Speckle noise reduction in optical coherence tomography images based on edge-sensitive cGAN

Speckle noise in optical coherence tomography (OCT) impairs both the visual quality and the performance of automatic analysis. Edge preservation is an important issue for speckle reduction. In this paper, we propose an end-to-end framework for simultaneous speckle reduction and contrast enhancement for retinal OCT images based on the conditional generative adversarial network (cGAN). The edge loss function is added to the final objective so that the model is sensitive to the edge-related details. We also propose a novel method for obtaining clean images for training from outputs of commercial OCT scanners. The results show that the overall denoising performance of the proposed method is better than other traditional methods and deep learning methods. The proposed model also has good generalization ability and is capable of despeckling different types of retinal OCT images. © 2018 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

Though the main aim of OCT denoising is to reduce the grainy appearance in homogeneous areas, another important issue is preservation of image details, especially the edges, because edges are the most vital information needed for both visual inspection and automatic analysis such as segmentation. As the noise level is high in OCT images, many spatial filters tend to oversmooth the image, resulting in reduced contrast at the edges. Block matching based methods can result in edge distortions caused by disagreement of the edges in different blocks. Transform-based methods also tend to produce artifacts with the shape of transform basis near the edges.
Recently, deep learning provides new ideas for image denoising. Mao et al. [17] proposed very deep convolutional encoder-decoder networks (RED-Net) with symmetric skip connections. Tai et al. [18] proposed a persistent memory network (MemNet). Zhang et al. [19] proposed residual learning of deep convolutional neural network (DnCNN) for natural image denoising. The network was designed to predict the residual image from the noisy input. Cai et al. [20] borrowed the idea, improved the network structure with residue module and applied it to OCT image denoising. However, in all these works, the additive noise assumption is used. Therefore these models are not most suitable for speckle noise in OCT images.
In this paper, we aim to remove speckle noise in Bscans from 3D OCT volumes exported from commercial retinal OCT scanners. Figure 1 shows the flowchart of the proposed method. We treat image denoising as an image-to-image translation problem, and propose a method based on conditional generative adversarial network (cGAN) [21] to achieve the goal. Trained by noisy images and corresponding high quality images obtained by registration and averaging, and with the competition of the generator and the discriminator, the network is able to learn the underlying clean structures of retinas. To our best knowledge, it is the first time that the image-to-image cGAN network is applied to OCT speckle noise reduction. The contributions of our work are listed as follows.
· We introduce a new edge loss into the objective function of cGAN and make the network sensitive to the edge information, thus achieving good edge preservation while smoothing the homogeneous areas.
· We propose a method for obtaining high quality training images that works for common users of commercial OCT scanners.
· By preprocessing of the training images, we make the deep network an end-to-end framework that achieves simultaneous speckle noise reduction and contrast enhancement.
· By data augmentation, we make the deep network capable of handling both OCT image from normal and pathological subjects, and also data from different types of scanners.

Overview of conditional adversarial networks
The conditional adversarial networks have been proved a good image-to-image translation model for tasks such as label-to-photo conversion, colorization, and semantic segmentation [21]. Different from the original GAN which learn a mapping from random noise to output image, the cGAN generates the output image conditioned on an observed image. cGAN consists of two modules with opposite goals: the generator G that extracts features of the observed image x and produces the corresponding fake image y fake , and the discriminator D that classifies between real pairs (x, y real ) and fake pairs (x, y fake ). The model structure is illustrated in Fig. 2 In training, G tries to minimize the objective against D that tries to maximize it, resulting in the following optimization: where * G represents the resulted optimized generator. Previous approaches to cGAN have found it beneficial to combine the cGAN's objective with a traditional loss, such as L1 or L2 distance, so that the generated image is more similar to the ground truth. L1 distance encourages errors that are sparsely distributed in space while L2 distance encourages errors that are uniformly distributed in space. Therefore L1 distance results in less blurring than L2 distance [21].
By adding the L1 distance, the optimization becomes where α is a weighting parameter.
In this paper, we further modified the objective function by adding a loss that is explicitly related to the edge information, to deal with the difficulty of edge-preserving while despeckling. The edge loss is defined as follows: where i and j represent coordinates in the longitudinal and lateral direction in the B-scan image. The edge loss measures the edge similarity between generated image and the ground truth, which is inspired by the edge preservation index (EPI). As the retina has a layered structure, the longitudinal gradient is more important than the lateral one. Therefore, considering the simplicity of the model, only longitudinal gradient is used in calculating the edge loss. Thus the final optimization is performed as: where α and β are the weighting parameters.

Implementation of cGAN
In this paper, the "U-Net" [22], a kind of encoder-decoder structure with skip connections between symmetric layers in the encoder and decoder stacks, is used as the main framework of the generator, and PatchGAN, that identifies real or fake pairs based on patches in an image, is adopted as the discriminator architecture. Modules of the form Convolution-BatchNorm-ReLU [23] are the basic components of both generator and discriminator. Detailed structures are given in the Appendix. In view of our task of despeckling OCT images, the despeckled OCT image shares the structure information with the corresponding noisy OCT image, which requires the structure of the output image of the generator remains aligned with that of the input image. This is a mapping problem from a high resolution input grid to a high resolution output grid.
Symmetric skip connections of U-Net provides an effective solution for the problem, which helps to produce the despeckled OCT image with more details due to the combination of lowlevel and high-level information.
In general, the discriminator in GANs outputs the probability that its input is real based on the whole image. Different from traditional discriminators, PatchGAN tries to identify whether each p × p patch in an image is real or fake. Such a discriminator regards the image as a Markov random field assuming independence between pixels separated by more than a patch diameter. The discriminator is run convolutionally across the whole image, averaging all responses of patches to achieve the final probability. One of the advantages of PatchGAN is that it can be applied to arbitrary-size images.
At training time, the Adam solver is applied to optimize the two adversarial networks. We adopt the standard approach of training GANs: optimization is carried out on the discriminator and the generator alternately [24]. At testing time, only the trained generator is used.

Ground truth for training
For deep network training, the input pairs of original noisy image and clean ground truth image are needed. However, for OCT image despeckling, there's no ground truth image readily available. As mentioned, good despeckling results can be obtained by averaging Bscans repeatedly acquired at the same location. Though commercial scanners such as Topcon DRI-1 offer such scanning protocol, it only outputs the final high quality image. The averaging calculation is completed by the proprietary software,and the raw noisy image is not available to common users. Here we propose an alternative way of obtaining training images pairs that is practical for any commercial OCT scanner users. The high quality images are obtained through registration and averaging of Bscans from multiple OCT volumes.
M 3D OCT volumes are obtained repeatedly from the same normal eye, with minimal eye movement between different acquisitions. One volume is randomly picked as the target image and denoted as 1 V , while other volumes are denoted as 2 In this method, we assume Bscans in a 3D volume within a small range share similar retinal structures. Registration is needed to remove possible misalignment in structure caused by eye movement or slight difference in scanning locations between different scans. The MSSIM measure ensures the best aligned images are averaged, to prevent blurring in the averaged results. The registration is performed using the imregister function of MATLAB (Mathworks, version 2012a and later). It is a multi-resolution registration method based on pixel intensities. The transform parameters are optimized by minimizing the mean square error of pixel intensities between the target image and the transformed image, using the gradient descent method. An image pyramid is built with decreased resolution by a factor of 2 in each dimension. The parameters are first optimized at the coarsest level of the pyramid and then successively refined on the next level, until getting back to the original full resolution image. In our experiments, the number of pyramid levels was set to 3. For gradient descent method, the m minimum step In our met piece-wise lin the mean of t 255].

Data
In our experim wavelength u acquired usin wavelength o 512 × 992 × repeated acqu ground truth, averaged ima blurred. The v 256 training B The trainin Japan) with c volume size w mm 3  All OCT data were uncompressed raw data exported from the scanners. For all acquisitions, we chose the 3D scanning mode with maximum number of Bscans provided by the scanner. In these modes, the output Bscans were the original acquisition, not averaged ones over several repetitions.
The study was approved by the Institutional Review Board of Soochow University, and informed consent was obtained from all subjects.

Implementation details
Data augmentation was used to allow the model to learn different characteristics of the testing data. Flipping in the lateral direction was used to simulate the symmetry of right and left eye. Different scaling factors were applied to simulate the four types of pixel size (geometric size of the Bscan divided by the number of pixels in corresponding dimensions) of testing data. Rotation was used to simulate different inclination of the retina in the OCT image. Non-rigid transformation was used to simulate the deformation caused by pathologies. These processing are applied randomly, and the training data is augmented with a factor of two.
In the experiment, the Adam solver with initial learning rate 2e-4 and momentum 0.5 was applied to optimize the two adversarial networks. The weighting parameters are selected as 100 α = and 1 β = , so that the L1 loss and edge loss are of the same order of magnitude. As tested, too large weight for the edge loss might make the training difficult to converge. The batch size was set as 1 and the number of training epochs was set as 100. The proposed method were coded in Python based on Tensorflow and trained using the NVIDIA GTX Titan X GPU with 12G memory.

Evaluatio
where max(I) represents the maximum pixel intensity of the image I, and σ b is the standard deviation of the background region.

Contrast-to-noise ratio (CNR)
CNR is a measure of the contrast between the region of signal and the noisy background region in the image. CNR of the i-th signal region is calculated as: where μ i and σ i denote the mean and standard deviation of i-th signal region in the image, while μ b and σ b denote the mean and standard deviation of the background region.
In our experiments, the average CNR is computed over the 3 signal ROIs.

Equivalent number of looks (ENL)
ENL is commonly used to measure smoothness of the homogeneous region in the image. ENL over i-th ROI in an image can be calculated as: where μ i and σ i denote the mean and standard deviation of i-th signal ROI in the image.
In our experiments, the average ENL is computed over the 3 signal ROIs.

Edge preservation index (EPI)
EPI is a performance measure that reflects the extent of maintaining details of edge in the image after denoising. EPI in the longitudinal direction is defined as: where I o and I d represent the noisy image and the denoised image, while i and j represent coordinates in the longitudinal and lateral direction in the image. This index may not be an accurate indicator of edge-preservation if calculated over the entire image, since after denoising, the gradient will become smaller in homogeneous regions. Therefore we only calculate the sums in (9) in the neighborhood of image boundaries. In our experiments, the neighborhood was set as a band with height of 7 pixels centered at the boundaries shown in Fig. 4. Figure 5 shows denoised Bscans from the 9 test data obtained using training data 1, corresponding to those in Fig. 4. By visual inspection, we can see that the proposed edgesensitive cGAN works well for the data with different resolution, obtained at different retinal locations, and both for normal and pathological retina. The retinal structures are preserved well while speckle noise is suppressed. The contrast between layers is also enhanced. After denoising, the background is homogeneous and almost black. The highest pixel intensities occur at the RPE layer. For Bscans in which the choroid is also visualized, such as in Fig.  5(a)(b)(c)(f)(g), both the capillary and the large vessels can be observed much more clearly. In order to further evaluate the effectiveness of the proposed edge sensitive cGAN, we design three groups of comparative experiments. The first group is aimed at comparing different objective functions. The second group studies the performance achieved by different training data group is comp Fi

Performa
We first stud trained using (4), and the distance and e evaluation me testing Bscans Figure 6 s term only hav boundary loss edge loss, the Table 2

Performa
We trained tw metrics are c shown in Fig. in Fig. 7 and which can be ENL, the valu same center w testing data, p ENL values a obtained by t training set, c 6. Results for two esults using cGAN sensitive objective  Fig. 9(i)(j tissues inside than training scanner with learned from t

Compari
We compare denoising and and 3D filte (STROLLR) based K-SVD local statistica In these exper between spec to default valu the same trai comparison, t pixel intensiti Figure 8 Fig. 8(e) is oversmoothed with many image details blurred, which is the reason of the high SNR and ENL, and also result in low EPI. However, the results in Fig. 9(e) is under-smoothed, which might be caused by the difficulty of dictionary learning from the low quality image. The result of MAP in Fig. 8(f) is under-smoothed while in Fig. 9(f) is a bit oversmoothed. This might be caused by the unstable estimation of speckle parameters for different images. Moreover, the background noise is not removed well and the contrast is low. The results of DnCNN present vertical artifacts, and it almost fails for testing data 8. This shows the poor generalization ability of the network, probably due to limited training samples. The results of ResNet have distortions at the edges, and the contrast between layers is lower than that of the proposed method. The proposed method with training data 1 obtains good denoising results for testing data 1. Especially, among all the methods compared, it best recovers the thin layer above the RPE complex, known as external limiting membrane (ELM), which can be viewed more clearly in the zoomed cropped image. For testing data 8, the result is a bit blurred, but still better than other methods. Therefore in summary, combining both subjective and objective evaluation criteria, the proposed edge-sensitive cGAN can obtain best results among the methods compared. With training data 1, it improved the SNR, CNR and ENL by 87%, 116% and 285%, respectively, with respect to the original image. While many denoising methods reduce the EPI, it improved the EPI by 6%, which means the edges are enhanced.
With training data 2, it improved the SNR, CNR and ENL by 127%, 120% and 265%, respectively, with respect to the original image. The mean EPI is comparative to the original image, indicating that the edges are mostly preserved.

Discussion and conclusions
In this paper, we propose an end-to-end deep learning framework that achieves simultaneous speckle reduction and contrast enhancement for retinal OCT images. The method is based on the image-to-image cGAN structure with a new edge-sensitive objective function. Unlike previous deep networks proposed for denoising [17][18][19][20] which try to estimate the noise residue from the noisy input, the cGAN learns the mapping from the noisy image to the clean image through the competition of generator and discriminator, and thus is not limited by the additive noise assumption. By introduction of the edge loss function, the method achieves a balanced performance in speckle reduction and structure preservation. A novel method is proposed for obtaining the ground truth images based on multiple volumetric scans of the same eye, which is easy to implement for users of commercial scanners. Thr paired with th trained deep n denoising. In augmentation samples.  clinicians. Secondly, as OCT despeckling acts as the preprocessing step for automatic OCT image analysis, we will study how the performance of tasks such as segmentation is improved by the proposed despeckling method. We don't list the time cost of each comparative methods here because it is unfair to compare other methods run on CPU or even with MATLAB codes that are not optimized in efficiency with the deep learning methods run on GPU. Still, the testing stage of deep learning methods has been proved very fast. The proposed method only requires an average of 0.22 seconds for denoising one Bscan, which can readily meet the real-time demand of clinical practice.
In conclusion, we have proposed an efficient and effective method that aims for speckle noise reduction in 3D OCT volumes exported from commercial retinal OCT scanners. The method achieves speckle noise suppression, edge preservation and contrast enhancement simultaneously. This method can be also extended to enhancement of other medical image modalities such as ultrasound image and low-dose CT image.

Appendix: Detailed structures of cGAN
The overall structure of U-shape generator is illustrated in Fig.10. It is a kind of encoderdecoder structure with symmetric skip connections. All convolution and deconvolution layers apply 4×4 spatial filters with stride 2. Each layer adopts BatchNorm except the first convolutional layer of the encoder. All ReLUs in the encoder are leaky with slope 0.2, while those in the decoder are not. The random noise z is implicitly implemented as dropout with rate 0.5, i.e., randomly dropping some outputs by the probability of 0.5, in the first three layers of the decoder. The dropout can also prevent overfitting effectively during training. Tanh is used as the activation function of the last layer in the decoder.
The discriminator architecture called PatchGAN is shown in Fig. 11. PatchGAN inputs real pairs or fake pairs, and produce the corresponding outputs. It has five convolution layers. All ReLUs in the first four layers are leaky with slope 0.2. The middle three layers adopt BatchNorm. 4×4 spatial filters with stride 2 are applied in the first three layers except for those in last two layers with stride 1. For this design, the size of PatchGAN's receptive field, i.e., the size of the patch p is set as 70, which makes PatchGAN have fewer parameters and run faster than traditional discriminators and still produce high quality results [21]. Sigmoid is used as the activation function of the last layer to achieve the purpose of identification. In the final 62×62 image, each pixel represents the probability that the corresponding 70×70 patch in the input is identified as real.