OCT-GAN: Single Step Shadow and Noise Removal from Optical Coherence Tomography Images of the Human Optic Nerve Head

Abstract: Speckle noise and retinal shadows within OCT B-scans occlude important edges, ﬁne textures and deep tissues, preventing accurate and robust diagnosis by algorithms and clinicians. We developed a single process that successfully removed both noise and retinal shadows from unseen single-frame B-scans within 10.4ms. Mean average gradient magnitude (AGM) for the proposed algorithm was 57.2% higher than current state-of-the-art, while mean peak signal to noise ratio (PSNR), contrast to noise


INTRODUCTION
Optical coherence tomography (OCT) is a well-established, noninvasive clinical imaging tool for in vivo viewing of cross-sectional images of optical nerve head (ONH) tissues with micrometer resolution [1].Although there have been vast improvements in imaging resolution, speed, and depth of OCT imaging, some limitations exist.Since OCT uses coherent illumination, speckle noise is a major source of noise that degrades the image quality of OCT B-scans [2].
Speckle noise is a multiplicative noise inherent in coherence imaging and is caused by multiple forward and backward scattering of light waves.It frequently reduces contrast and the grainy speckle noise pattern has been found to limit both the axial and lateral effective image resolution [3].Subtle but important morphological details, such as individual tissue layers [4,5,6] are prevented from being identified and observed [7], making speckle noise detrimental to clinical diagnosis [8].
The most common speckle removal approach adopted in commercial OCT machines is B-scan averaging [9].Although high quality images can be produced using this technique, the longer scan durations required for this technique cause other artifacts such as registration errors [10], motion artifacts [11] to appear on processed images due to eye or head motion [12].The inability of elderly or young patients to remain fixated for long periods of time further render this technique difficult to obtain 3D scans of the ONH [13] of relatively good quality.Furthermore, multi-frame averaging does not prevent OCT signals obtained from locations beneath retinal blood vessels from being significantly diminished due to the scattering at the blood flowing through retinal blood vessels.This phenomenon produces artifacts in OCT images known as retinal shadows.These artifacts appear perpendicular to retinal layers, interrupting tissue layer continuity and causing errors in segmentation [14].This in turn leads to inaccurate extraction of important structural metrics such as thickness of the retinal nerve fiber layer (RNFL), which is important in glaucoma monitoring [15].Retinal shadows also reduce visibility of deep structures such as the anterior and posterior boundaries of the lamina cribrosa (LC), as weak, reflected signals from these structures are further attenuated by the lower incident light intensity within retinal shadows [16].Recently, deep learning techniques have shown promise in reducing speckle noise.Mao et al. used a deep, fully convolutional encoding-decoding framework to suppress noise and perform super resolution analysis of input images [17].Later in 2018, Ma et al. proposed an edge-sensitive generative adversarial network (GAN) to remove speckle noise from OCT images produced by commercial scanners [18].Devalla et al. leveraged deep neural networks (DNNs), residual learning, and dilated convolutions to extract multi-scale features and contextual information to recover information lost due to speckle noise in OCT images of the ONH [19].Many other works attempted to remove speckle noise with varying success, with a common recognition of the major quality degrading factor that speckle noise inflict on OCT images [20,21,22].Some have attempted to remove retinal shadows as well.In 2011, Girard et al. developed two OCT modelling approaches to be used in conjunction, one to compensate for light attenuation and the other to enhance contrast in OCT images [16].Later, in 2018, Vupparaboina et al. illustrated an improvement in choroid representation after shadow compensation [23].Our more recent work [24] used a weighted custom loss function that removed shadows from multi-frame averaged images and illuminated faint features within retinal shadows.However, the above-mentioned algorithms require high quality images free from speckle noise and motion artifacts to function well, preventing users in possession of single-frame images and lowcost hardware from availing themselves to this technology.
The presence of speckle noise, motion artifacts, and retinal shadows often interact and overlap, complicating processes that attempt to alleviate and remove these quality degrading phenomena [4,25].Such attempts are often tedious and prone to errors, because multiple separate processes need to work together to remove each artifact individually, with the ordering of artifact removal causing issues for the other processes.
In this study, we aimed to develop an algorithm to remove both speckle noise and retinal shadows within a single step.By doing so, we will be able to reduce the cost of OCT devices by using simpler OCT imaging hardware enhanced by software.

METHODS
2.1 Patient Recruitment 24 healthy subjects (average age: 28 years) were recruited at the Singapore National Eye Centre (SNEC).All subjects gave written informed consent.This study adhered to the tenets of the Declaration of Helsinki and was approved by the institutional review board of the hospital.The inclusion criteria for healthy subjects were an intraocular pressure (IOP) of less than 21mmHg and healthy optic nerves with a vertical cup-to-disc ratio of ≤ 0.5.

OCT Imaging
Recruited subjects were seated and imaged in dark room conditions by a single operator (TAT).A standard spectral domain OCT system (Spectralis; Heidelberg Engineering, Heidelberg, Germany) was used to image both eyes of each subject.Each volume contained 97 horizontal B-scans (32-µm distance between B-scans; 384 A-scans per B-scan) from a rectangular area 15°× 10°centered on the ONH.We obtained a total of 2628 single-frame B-scans (noisy, without signal averaging) and 2628 multi-frame B-scans (clean, averaged over 75 frames).Enhanced depth imaging (EDI) [26] and eye tracking [27,28] modalities were used during the acquisition.

Overall description
Our algorithm was a single step approach to removing both speckle noise and retinal blood vessel shadows simultaneously.It had two actively-trained networks competing with one another.The first network was referred to as the shadow detector network, and it predicted which pixels would be considered as shadowed pixels.The second network was referred to as the image processor, and it aimed to remove shadows and speckle noise simultaneously from single-frame OCT images such that the first network (shadow detection network) could no longer identify shadowed pixels.Briefly, we trained the shadow detection network once on multi-frame images with added Gaussian noise with their corresponding manually segmented shadow mask as the ground truth.We added Gaussian noise instead of speckle noise as training deep learning models to denoise B-scans with Gaussian noise provided empirically better results.It is unfortunately, difficult to ascertain the reason for this phenomenon.
First, binary segmentation masks (size 496 × 384) were manually created for all 2328 training B-scans using ImageJ [29] by one observer (HC) where shadowed pixels were labelled as 1 and shadow-free pixels were labelled as 0. Next, we attempted to model single-frame images by creating "noisy" images.This was done by adding Gaussian noise to multi-frame images.Seven feature representations of each noisy image and its multi-frame counterpart were extracted using three pre-trained perceptual networks in order to train the image processor network to output multi-frame quality images from input noisy images.Finally, we trained the image processor network by passing the multi-frame image (with artificial Gaussian noise) as input and using the predicted binary masks as part of the loss function.More details about the overall algorithm can be found below (Fig. 1).

Shadow Detection Network and Image Processor Network Architecture
A neural network inspired by the UNet architecture [29] (Fig. 2) was trained with a simple binary cross entropy loss [30], using the noisy images (multi-frame + Gaussian noise) as inputs and the manually segmented masks as ground truths.This network had a sigmoid layer as its final activation, making it a per-pixel binary classifier.The shadow detector network first performed two convolutions with kernel size 3 and stride 1, followed by a ReLU activation [31] after each convolution.Then, images were downsampled with a 2×2 kernel, halving the height and width of the feature maps.This occurred four times, with the number of feature maps at each smaller size increasing from 1 to 64, 128, 256, and 512, respectively.The shadow detection network was comprised of two towers.A downsampling tower halved the dimensions of the input image (size 512 × 512) via maxpooling to capture contextual information such as the spatial arrangement of tissues, and an upsampling tower sequentially restored it back to its original resolution to capture the local information such as tissue texture [19].Output images were then linearly scaled to values between 0 and 1 by subtraction of its minimum value and division by its maximum value.

Image Augmentation
To ensure that our algorithm was robust and functioned on single-frame images with varying levels of noise and retinal shadows, we implemented online image rotation (-45°to 45°), XY translation (-50% to 50% of image size), image scaling (-50% to +50% of image size) and random horizontal flip during our training.

Speckle Noise Modelling
We needed to add noise to multi-frame images to simulate the speckle noise found in single-frame images.The goal was to train the image processor to remove this artificial noise and in turn enable the image processor network to remove genuine speckle noise found in single-frame OCT images.We found through experiments that speckle noise was able to be modelled as Gaussian noise (µ = 0, σ = 1) multiplied with a uniform distribution (range 0.02 to 0.5).In addition, including a large range for the Gaussian model helped the algorithm to perform robustly on single-frame images, which had varying levels of noise.These numbers were experimentally obtained by qualitative assessment of test images generated from single-frame images.A new noise sample was created for every multi-frame image during training to encourage robust training of the image processor network.

Feature Extraction
As using mean squared error (MSE) directly on processed images as a loss function was found to produce blurring effects on processed images, we instead applied MSE onto extracted feature representations of noisy images and their corresponding multi-frame B-scans.To extract comprehensive feature representations of each image, we required capable, pre-trained networks for feature extraction.Our framework consisted of three pre-trained and frozen feature extraction networks (Fig. 1).These frozen networks were used to extract features from input images and will henceforth be referred to as perceptual networks.We used the three classification networks trained on ImageNet as our perceptual networks, namely, EfficientNet-B4 [32], WideResnet101 2 [33], and Resnext101 32x8d [34].We leveraged the "ensemble effect" whereby gradients were averaged from three different highly accurate perceptual networks to produce a more accurate backpropagation update [35] for the image processor network.High-level feature representations were extracted from the final convolutional layer of EfficientNet-B4, while both intermediate and high level feature representations were extracted from residual block 2, 4, 6, and 8 for WideResnet101 2 and Resnext101 32x8d for computation of content and style losses.Each feature representation of a processed image was compared to (using MSE) the feature representation of its corresponding multi-frame image.These comparisons were then included in a custom loss function that we describe in the next section.

Loss Function for Training the Shadow Detector and Image Processor networks
We successfully trained the image processor network and simultaneously removed speckle noise and retinal shadows using a combination of different loss functions.These losses were:

Shadow Loss
The shadow loss was defined to ensure that all shadows were effectively removed so that they become indistinguishable from surrounding tissues.When a given image X had been processed, it was passed to the shadow detector network to produce a predicted shadow mask, M X (with maximum pixel intensities equal to 1).All pixel intensities were then summed, and then normalized by dividing this sum with the sum of the pixels within the ground truth manually segmented mask.This normalized sum was defined as the shadow loss.

Content Loss
We used the content loss to ensure that critical information within all non-shadowed regions of a given image was retained after shadow correction.To compute content loss, we compared intermediate and high-level feature representations between a given processed image D and its corresponding multi-frame image C. Note that the content loss had been used in Style Transfer [36] with great success at maintaining fine details and edges.We first applied the manually segmented shadow mask to the processed image, D masked and its corresponding multi-frame image, C masked .This blocked out pixels in the retinal shadows so that the content loss would not be affected by any shadow removed.Next, we extracted feature representations from all perceptual networks for the processed image, and its corresponding multi-frame image.The content loss was then defined as: where P i is a feature representation of the ith selected residual block of a perceptual network.Note that i=2,4,6,8 for the WideResnet101 2 and Resnext101 32x8d perceptual network and i refers to the last convolutional layer of the EfficientNet-B4 perceptual network.

Style Loss
To ensure that image textures remained the same in non-shadowed regions after shadow correction, we computed the style loss for masked processed image D masked and its corresponding multi-frame image, C masked .To compute the style loss, we first calculated the Gram matrix of an image to find a representation of its style.Then, the style loss for each image pair (D masked , C masked ) was defined to be the Euclidian norm between its Gram matrices: where G i (x) is a C i × C i matrix defined as:

Total Loss
The total loss was computed as a weighted sum of the content, style, and shadow losses to ensure all losses were of the same order of magnitude.The shadow loss was set as the reference (as already being normalized) and no weight was assigned.The total loss was defined as: where w j and k j are the weights to be derived experimentally; j summed over the type of perceptual network, i.e.EfficientNet-B4, WideResnet101 2, and Resnext101 32x8d.To obtain the weight values, we first trained the image processor network without style loss (k=0) to determine all w.We then introduced all style losses and normalized them so that their magnitudes were on the same scale as the content losses.

Training Parameters
We used 2328 multi-frame averaged B-scans during training and 300 single-frame B-scans with its corresponding multi-frame B-scan during testing.These multi-frame images were used as the ground truth images for the content and style losses, but not for the shadow loss such images still contained shadows.Each B-scan was added with a randomly generated Gaussian noise model (created according to section F).
During training, the image processor network learnt how to remove the randomly generated Gaussian noise using content and style losses, and it simultaneously learnt to remove retinal blood vessel shadows through the use of the shadow loss.
All training and testing were performed on five Nvidia GTX 1080 Ti cards with CUDA V10.1.105,paired with Nvidia driver V436.48 and cuDNN v7.6.5.Using these hardware specifications, each image took an average of 10.3 ms to be processed.The total training time was 4 days using the Adam optimizer at a learning rate of 1 × 10-5 and a batch size of 6.A learning rate decay was implemented to halve learning rates every 10 epochs.We stopped the training when no improvements in output images could be observed.

Noise and Retinal Shadow Removal Metrics
We used average gradient magnitudes (AGM), the peak-signal-to-noise-ratio (PSNR), the contrast-to-noiseratio (CNR) and the mean-structural-similarity-index (SSIM) to quantify the noise removal capabilities of our proposed algorithm.All noise removal metrics were normalized with respect to their corresponding multi-frame image for easy comparison.All noise removal metrics were extracted from regions of interest (ROIs) that did not contain retinal shadows to prevent shadow removal from affecting noise removal metrics.We also used the intra-layer contrast (ILC) and the layer-wise pixel intensity (LPI) profiles to assess the proposed algorithm's effectiveness in removing shadows.During testing, we obtained all metrics on noisy, non-averaged single-frame B-scans.All multi-frame B-scans were then aligned to their corresponding singleframe B-scan using rigid translation/rotation transformations using 3D software (Amira, version 5.6; FEI) before noise and shadow removal metrics were extracted.

Noise Removal Quantitative Assessment
The AGM was used to quantify the sharpness of output images.We used the AGM implementation found in the Python package Numpy [37], defined as: where G(x, y),H and W were the gradient vector, height and width of the B-scan respectively.
The PSNR (expressed in dB) was used to quantify the noise levels in an image relative to its true signal strength.We used the scikit-image implementation of PSNR defined as: where f 0 was the pixel-intensity values of the registered multi-frame B-scan, and f was the pixel-intensity of the processed B-scan.A higher PSNR suggested that the processed images contained less noise and were of higher quality than images with lower PSNR.
The CNR provided an indication of how visible a retinal tissue layer is.It was defined as: where µ r , µ b , σ 2 r andσ 2 b represented the means and variances of pixel intensities for a selected ROI within the tissue 'i' and a randomly chosen ROI from the background, respectively.Each ROI was chosen as a 20 × 384 pixels region at the top of the selected B-scan.A higher CNR suggested superior visibility of tissue 'i' within a given B-scan.We computed the CNR for the RNFL and compared them between single-frame, processed, and multi-frame B-scans.We computed the CNR as a mean of 25 randomly selected ROIs per tissue for each given B-scan, each of size 8 × 8 pixels.All ROIs were manually chosen in each tissue by an expert observer (HC) using a custom Python script using the OpenCV [38] package.
The SSIM was computed to quantify changes in tissue structures (i.e., edges) between a given singleframe/processed image with its corresponding multi-frame image as a reference.The SSIM was based on the computation of three terms: luminance, contrast, and structure, respectively.We used the implementation of the SSIM in the scikit-image package in Python defined as: where σ x , µ y , σ x , σ y , andσ xy were the local means, standard deviations, and cross-covariance for images x, y.

Shadow Removal Quantitative Assessment
We computed the ILC to assess the performance of the proposed algorithm in removing shadows.The ILC was defined as: where I 1 was the mean pixel intensity from five manually selected ROIs (size 5 ×5 pixels) that are shadow free in a given retinal layer, and I 2 was the corresponding value from five neighboring shadowed regions of the same tissue layer.The ILC ranged between 0 and 1, where values close to 0 indicated the absence of retinal shadows and values close to 1 indicated strongly visible blood vessel shadows.
We computed the intralayer contrast for multiple tissue layers of the ONH region, namely the RNFL, the photoreceptor layer (PR) and the retinal pigment epithelium (RPE) -before and after application of the proposed algorithm.Results for all metrics were recorded in the form of mean ± standard deviation.

RESULTS
When trained on 2328 multi-frame B-scans with online data augmentation, our deep learning framework was able to successfully remove noise and retinal shadows from unseen single-frame B-scans (Fig. 3).An independent test set of 300 single-frame B-scans was used to evaluate the noise and retinal shadow removal performance of the proposed deep learning framework qualitatively and quantitatively.The mean PSNR, the CNR and the SSIM increased with respect the input single-frame B-scans from 18.5 ± 0.46 dB to 20.5 ± 0.38 dB, 3.66 ± 0.92 to 8.97 ± 2.60 and 0.177 ± 0.004 to 0.45 ± 0.09 respectively.The ILC for the RNFL, the PR, and the RPE decreased from 0.362 ± 0.133 to 0.142 ± 0.102, 0.449 ± 0.116 to 0.090 ± 0.077, 0.381 ± 0.100 to 0.059 ± 0.045 respectively.

Deshadowing Performance -Quantitative Analysis
Our proposed algorithm produced images that had improved visibility within retinal shadows.The ILC for the RNFL, the PR, and the RPE improved by 60.0 ± 29.3%, 79.0 ± 19.4% and 83.4 ± 15.4% respectively.On average, the ILC improved by 72.9 ± 25.2% (Fig. 6).The LPI profiles were also significantly flattened in the RNFL, the PR and the RPE layers (Fig. 7).15 In this study we present a custom deep learning approach that can remove noise and retinal shadows simultaneously from single-frame OCT B-scans of the ONH.All noise removal performance metrics such as the PSNR, the CNR, and the SSIM values consistently showed significant improvements compared to singleframe images.Thus, we may be able to offer a robust deep learning framework to obtain high quality OCT B-scans with reduced scanning duration and minimized patient discomfort.B-scans processed by our algorithm were qualitatively similar to their corresponding multi-frame B-scans, with the added benefit of improved visibility within retinal shadows (Fig. 3, Fig. 4).The SSIM and the CNR were significantly improved with respect to single-frame images by 154% and 187%, respectively.The mean AGM was also 57.2% higher than the current state-of-the-art [19], providing clinicians a markedly sharper image.Image sharpness is critical given that many pathologies require sharp layer boundaries for accurate retinal layer thickness measurements.One example would be quantifying macular edema, which require measurements of retinal thickness in response to therapy [39].Given the significance of retinal layers and connective tissues in the prognosis and diagnosis of ocular pathologies such as glaucoma and age-related macular degeneration, enhanced visibility would improve automated algorithms downstream of the post processing pipeline, namely alignment, registration, segmentation, diagnosis and, ultimately, prognosis.The proposed algorithm did not require any further segmentation, delineation, or identification of shadows by the user.Similar to our previous work [24], the ILC mean and standard deviation decreased with the depth of the retinal layer of interest (Fig. 6), suggesting that performance of the proposed algorithm was consistently better in deeper layers.The proposed algorithm substantially recovered the visibility of the anterior lamina cribrosa (LC) boundary and anterior LC insertion, which may result in a more confident prediction of early glaucoma [40].Moreover, the main load bearing tissues of the eye in the ONH region, such as the LC and adjacent peripapillary sclera, could be monitored for pre-disease biomechanical and morphological changes.Changes in these tissues have been previously identified as risk factors for glaucoma [41].Measurements of the anatomy of such tissues could be more robust and substantially improved after application of the proposed algorithm.
In this study, several limitations warrant further discussion.While we did not find any evidence of pathology being obscured or introduced into output images, it is extremely important to validate this in pathological cases.However, we would need to image the exact same tissue region with and without the presence of blood flow (to remove retinal blood vessel shadows).Such experiments would be extremely complex to carry out in vivo, especially in humans, even if blood vessels were to be flushed with saline during experiments.Such validations may be required for full clinical acceptance of this methodology.Furthermore, it would be critical to also confirm that the proposed algorithm would not interfere with another AI algorithm (especially those aimed at diagnosis and prognosis).Nevertheless, it is possible that the proposed algorithm might improve diagnosis and prognosis algorithms by improving the quality of the input data.We aim to test this hypothesis in the future.Furthermore, although the proposed algorithm functioned well on single-frame images from healthy individuals, more work is required to ensure that it can reproduce similar performance on B-scans of eyes with pathophysiological conditions such as glaucoma.This is especially critical for deep learning approaches, which respond unpredictably to input data that is different from images used during training.As this algorithm was trained on single-frame images from a Spectralis OCT device, it is unknown if it can maintain this performance on OCT images from other devices.Each scenario stated above may require a separate training set.Our future studies will therefore focus on validating the performance of the proposed algorithm across devices and between healthy and pathological eyes.

Figure 1 :
Figure 1: Overview of the proposed deep learning framework.

Figure 2 :
Figure 2: UNet Architecture used in the shadow detector network.

Figure 3 :
Figure 3: Samples of typical B-scans before (left) and after (right) being processed by our algorithm.10

Figure 4 :
Figure 4: Qualitative analysis of the proposed noise removal and shadow removal algorithm.Blurring can be seen in the current state-of-the-art (second row) [19].