Real-time noise reduction based on ground truth free deep learning for optical coherence tomography.

Optical coherence tomography (OCT) is a high-resolution non-invasive 3D imaging modality, which has been widely used for biomedical research and clinical studies. The presence of noise on OCT images is inevitable which will cause problems for post-image processing and diagnosis. The frame-averaging technique that acquires multiple OCT images at the same or adjacent locations can enhance the image quality significantly. Both conventional frame averaging methods and deep learning-based methods using averaged frames as ground truth have been reported. However, conventional averaging methods suffer from the limitation of long image acquisition time, while deep learning-based methods require complicated and tedious ground truth label preparation. In this work, we report a deep learning-based noise reduction method that does not require clean images as ground truth for model training. Three network structures, including Unet, super-resolution residual network (SRResNet), and our modified asymmetric convolution-SRResNet (AC-SRResNet), were trained and evaluated using signal-to-noise ratio (SNR), contrast-to-noise ratio (CNR), edge preservation index (EPI) and computation time (CT). The effectiveness of these three trained models on OCT images of different samples and different systems was also investigated and confirmed. The SNR improvement for different sample images for L2-loss-trained Unet, SRResNet, and AC-SRResNet are 20.83 dB, 24.88 dB, and 22.19 dB, respectively. The SNR improvement for public images from different system for L1-loss-trained Unet, SRResNet, and AC-SRResNet are 19.36 dB, 20.11 dB, and 22.15 dB, respectively. AC-SRResNet and SRResNet demonstrate better denoising effect than Unet with longer computation time. AC-SRResNet demonstrates better edge preservation capability than SRResNet while Unet is close to AC-SRResNet. Eventually, we incorporated Unet, SRResNet, and AC-SRResNet into our graphic processing unit accelerated OCT imaging system for online noise reduction evaluation. Real-time noise reduction for OCT images with size of 512×512 pixels for Unet, SRResNet, and AC-SRResNet at 64 fps, 19 fps, and 17 fps were achieved respectively.


Introduction
Optical coherence tomography (OCT) imaging has been widely used in the field of medical diagnosis due to its advantages of non-invasiveness, high sensitivity and high resolution [1][2][3][4]. However, noise is inevitably generated such as thermal noise, shot noise of the detectors and inherently speckle noise during the imaging process. The presence of noise will decrease the contrast and resolution of OCT image, resulting in the degeneration of the image quality, which can cause issues for diagnosis. At the same time, noise could also affect post-processing of OCT images, such as image segmentation [5][6][7]. Therefore, in the field of OCT imaging, noise reduction has always been an urgent problem to be solved and one of the hot research topics.
Traditional OCT image noise reduction methods can mainly be divided into hardware and software categories. The hardware methods are mainly divided into frequency compounding and spatial compounding (multi-frame averaging technology) [8][9][10][11][12][13][14][15]. Although multi-frame averaging has been proven to be effective in reducing noise, acquiring B-scan images at the same position multiple times requires both a long scan time and a long time for the patient to remain stationary. In addition, it depends on the accuracy of registration algorithms [14]. It is quite a challenge for patients, especially the elderly and children, and it could also cause a certain degree of discomfort [14]. The software methods rely on post image processing algorithms such as non-local mean (NLM) filtering and block-matching and 3D (BM3D) filtering algorithm [16,17]. However, these conventional noise reduction methods could inevitably cause the destruction of image details, reduce the contrast at the edge of OCT images, and result in a degeneration of image quality. Some of them also have the problem of long processing time [18,19], making them difficult to meet the clinical real-time noise reduction requirement.
Recently, deep learning has been widely used in the field of OCT image denoising. Development of convolutional neural network (CNN) has also shown great potential in recent years [20][21][22][23][24][25]. The CNN can effectively extract image information from a large number of training samples, so it is widely used in the field of noise reduction. Ma et al. used a conditional generative adversarial network (cGAN) to reduce noise of retinal OCT images, and this method is better than other traditional method in performance and generalization ability [21]. Qiu et al. used a convolutional network with perceptually-sensitive loss function to denoise OCT images, and this method was proved to be superior to NLM and BM3D in preserving image details [24].
Compared with traditional algorithms, deep learning noise reduction method has shown promising improvement in image quality, especially in preserving the details of the image edges. To train these deep learning methods, it is necessary to prepare the clean ground truth corresponding to the noisy images as labels [26,27]. However, it is very difficult to obtain a noisy image and the corresponding clean ground truth. In the field of OCT imaging, it is often necessary to acquire multiple frames of B-scan OCT images at the same location and then register and average the images to get the ground truth, which is often a complicated process.
Based on a deep learning method named Noise2Noise, we propose a deep learning method in noise reduction for OCT images without obtaining noise free ground truth as labels [28,29]. With this method, we only need to obtain any two B-scan OCT images at the same sample location, taking one noisy image as input and the other noisy image as the label. The underlying principle is that noise from two different OCT images should be different while the true sample structure should be the same. Three network structures including Unet and super-resolution residual network (SRResNet) as used in previous work [28,29] and our modified asymmetric convolution super-resolution residual network (AC-SRResNet) were trained and evaluated using signal-tonoise ratio (SNR), contrast-to-noise ratio (CNR) and edge preservation index (EPI). Effectiveness of these three trained models on OCT images of different samples and different systems were investigated and compared with traditional BM3D methods. Eventually, we incorporated these three models into our graphics processing units accelerated OCT imaging system for online noise reduction evaluation. To the best of our knowledge, currently no implementation and evaluation of deep learning noise reduction methods for online real-time OCT imaging have been reported.
The remainder of this paper is organized as follows. Details of our method are illustrated in Section 2. Experimental results and comparison with traditional methods on different samples and systems in addition to real-time performance evaluation are presented and discussed in detail in Section 3. Finally, the main conclusions are presented in Section 4.

Methods
Detailed information of Noise2Noise method can be found in [28,29]. With deep learning when using a convolutional neural network to deal with the image noise reduction problem, if the input image is x and the target image is y, then the noise reduction problem can be regarded as the following parameter optimization problem: where L is loss function, E is the expectation of the observations, f θ (x) is network function, and θ is network parameter. If the entire training task is decomposed into same minimization problem at every training sample, according to Bayes theorem, Eq. (1) is equivalent to: If both the input and the label are corrupted with noises, the objective function of the network can be seen as: arg min wherex i = x i + σ ′ ,ŷ i = y i + σ ′′ are one of the i-th noisy image pairs at the same position, x i is the noise-reduced image and y i is the unobserved clean label. Both σ ′ and σ ′′ represent different noise contents following the same underlying distributions respectively. As demonstrated in [29], the solution of minimizing Eq.
(3) will result in f (x i ) = x i given sufficient data as E{ŷ i |x i } = y i .Therefore, as long as the number of samples is large enough, even both the input and label images have noise, the output image will be a noise-reduced image. It worth mentioning that speckle noises inherent with OCT images caused by microstructures of samples should be considered inherent structural signals instead of random noises. The reason is that speckle patterns of static samples don't change between frames.

Data sources
We obtained OCT images of various samples including finger nails, hand palms, tomato, sample tooth, plastic tubes and thin films to form the data sources. The OCT system we used is a home-built spectral domain OCT imaging system, and the system parameters are as follows: the central wavelength of the light source is 1300 nm, the scanning interval frequency is 70 kHz, the axial resolution is 14 µm, the imaging depth is 6.7 mm, the system sensitivity is 92 dB. We collected 15 sets of OCT images of different samples, where 10 sets of images were 250 B-mode images of the sample used as the training data set, and the other 5 sets were C-mode images of the sample with 250 B-frames used as the test data set. The training data set consists of B-mode images of static samples to meet the requirement of images at the same position. In addition to that, no registration or match operation is necessary since the samples are static.
Original size of the acquired image is 1000×1024 pixels. Considering the training speed-up and the limitation of GPU memory size, we adjusted the size of input image and label to 256×256 pixels. Data augmentation was performed by randomly cropping each image to get 12 sub-images and then resized, which left us with 120 training data sets.
Unlike other deep learning training methods, time-consuming pairing of input images and labels on the training data set is not necessary here. We randomly select one data set from 120 training data sets, and then randomly select two noisy images from the 250 B-scan OCT images of this data set. One of the two noisy images is used as the input image, while the other is used as the label. Any one of the 250 noisy images in this data set can be either an input image or a label.

Network architectures
In this work, we chose Unet and SRResNet and AC-SRResNet as our feature extraction network for comparison since they have demonstrated powerful capabilities in image feature extraction and have been widely used in image segmentation [30] and noise reduction [31]. Unet is a typical lightweight CNN network while SRResNet and AC-SRResNet are relatively deeper and more powerful at feature extraction.
Architecture of our proposed AC-SRResNet is shown in Fig. 1. The 3×3 convolution kernel was replaced by asymmetric convolution block in SRResNet. Asymmetric convolutional network architecture, which replaces the convolution kernel in the CNN architecture with an asymmetric structure, can improve the accuracy of deep learning [32]. We replaced the 3×3 convolution kernel with parallel layers with 3×3, 1×3, and 3×1 convolution kernel respectively to improve the accuracy. If the data belongs to small batches, that is, data with a batch size less than 32, there could be a problem of reduced effectiveness during batch normalization processing. Limited by GPU memory, our batch size was set to 4, which belongs to small batches of data. Therefore, we used batch renormalization instead of batch normalization to ensure the effectiveness of normalization [33].

Training procedure
Training parameters were tuned empirically. We chose Adam optimizer and the learning rate I r was 0.005, and the learning decay rate was set as 1.67×10 −5 . The maximal iteration numbers were set to be 300. The model with minimum loss value over 300 iterations was saved for further evaluation. Both L 1 and L 2 loss functions were adopted for testing in our study, which are defined as: m and n are the size of the image, and in our study, the values of m and n are both 256.I n (i, j) and I d (i, j) are the gray values of the output image and the label, respectively. Generally, L 2 loss function is used for noises with zero-mean, such as additive Gaussian and Poisson noises while L 1 loss function seeks to recover the median of targets [29].

Quantitative evaluation
We adopted signal-to-noise ratio (SNR), contrast-to-noise ratio (CNR) and edge preservation index (EPI) as the objective comparison parameters. SNR is the most common and widely used image evaluation index in the field of noise reduction. In general, the larger the value of SNR, the better the noise reduction effect of the image. It is defined as follows: where I o is the value of object region, σ b is the standard deviation of the noisy background region, M and N are the ROI height and width respectively. CNR can effectively evaluate the contrast between the object region and the noise background region. The definition of CNR is shown as below: Where µ o and µ b are the average values of the object region and noisy background region respectively, σ b are the standard deviation of the noisy background region respectively.
EPI can effectively evaluate the degree of preservation of edge details of noise reduction images. It can be defined as: where I d and I n are the pixel value of noise reduction image and noisy image, M and N are the ROI height and width respectively. It worth mentioning that definition of EPI captures edges in the horizontal direction as images we will test are clear and abundant in horizontal edge features.

Real-time online imaging system software architecture
The deep learning-based noise reduction models were integrated into our GPU accelerated customized OCT imaging software platform to evaluate their real-time online performance. The architecture of the customized software platform is shown in Fig. 2. It was developed using Visual Studio 2015 with Qt as the graphics user interface, which contains four separate threads including acquisition thread, CUDA processing thread, image plotting thread and Tensorflow C-API image denoising thread. Threads communication and synchronization were achieved using Qt signal and slots mechanism.
When the imaging process starts, the acquisition thread will acquire raw B-mode data from the spectrum data pool. Here we control the whole system imaging speed by modifying how fast we acquire the raw B-mode data. Once the raw B-mode data is ready, the acquisition thread will emit a signal to CUDA process thread to transfer that data into the GPU memory for OCT signal processing including wavelength-to-wavenumber interpolation, reference subtraction, FFT and post image magnitude and log mapping to form the raw B-mode image data. Once the raw B-mode image data is ready, the CUDA processing thread will emit a signal to the plotting thread to display the processed image and a second signal to the Tensorflow C-API image denoising thread to transfer the B-mode image data into pre-allocated deep learning processing memory to perform noise reduction and denoised image display. It needs to be pointed out that since both Tensorflow C-API image denoising and CUDA OCT signal processing share the same amount of GPU memory, we specifically allocated 40% of the GPU memory for deep learning processing and the left was for CUDA processing whenever needed. Since Tensorflow C-API image processing using tensor for image denoising, it was necessary to convert the unsigned char image data processed by CUDA to tensor specifically.
A workstation with Intel Xeon CPU E5-2620 (2.4 GHz), 48GB RAM, Windows 10 64-bit operation system is used as the host computer for online noise reduction performance evaluation that contains one NVIDIA Geforce RTX 2080Ti GPU with 4352 stream processors, 1.4 GHz processor clock and 11GB global memory. Customized OCT imaging software was developed combing Qt (v5.6.3) and Microsoft Visual Studio 2015. Deep learning model based noise reduction was implemented with Tensorflow C-API (v1.12.0) as the engine. CUDA (v10.1) was used for GPU accelerated data processing.

Effect of resizing in data augmentation
During the training process, for data augmentation 12 regions of interest with random sizes were first cropped from one original OCT image of size 1000×1024 and then resized into a fixed size of 256×256 using the bi-linear interpolation operation. To show the effect of the resizing in data augmentation, we trained all the three models both with and without the resizing operation using L 2 loss function. From the loss curves shown in Fig. 3, we can see that with resizing, all three models reached lower convergence levels. One exemplary transparent gel board sample image denoising comparison is shown in Fig. 4. We can see that resizing makes the output images cleaner and contributes to better denoising effect. This agrees with the fact that resizing increases the noise spatial variety in the training dataset for the model to learn.

Comparison between networks
To compare the noise reduction performance of Unet, SRResNet, AC-SRResNet and traditional BM3D method, we test the results with five different samples including human finger nail, tooth sample, onion, human hand palm and transparent gel board. Figure 5 shows the original noisy image and images processed with BM3D and three models using L 2 loss function. Figure 6 shows the images processed with three models using L 1 loss function. Visual inspection of Fig. 5 can tell all three models using L 2 loss outputted noise-reduced images with better preserved details compared to BM3D as images denoised by BM3D methods show no grainy speckle pattern. Visual inspection of Fig. 6 shows that SRResNet and AC-SRResNet using L 1 loss achieved similar performance while the Unet model generated images with noticeable nonuniform background feature compared to Fig. 5. Figure 7 shows an exemplary hand palm OCT image denoising effect comparison between all methods. The red box region marked on the original image of each denoised result was enlarged and shown on the right. From Fig. 7 we can see that while reducing the noise Unet model introduced small uniform patches on the image. Phenomenon like this is not noticeable with other two models. The reason might be that capability of Unet as a lightweight network is inferior to more sophisticated SRResNet and AC-SRResNet. To evaluate the performance of all these methods quantitatively, we calculated the SNR, CNR, EPI and computation time (CT) shown in Table 1. Three regions of object and edge details were chosen on each image for analysis. We selected 20 noise images from above five test sets. A total of 100 images were denoised. Average SNR, CNR, EPI, and CT were calculated with these 100 images. Overall the noise in original images have been reduced. BM3D demonstrates highest SNR of 44.31 dB and CNR of 43.49 dB with lowest EPI of 0.54 and longest processing time of 21.96 s. Compared to BM3D method, deep learning-based noise reduction methods showed advantages in both detail preservation and computation time.
For Unet model, L 2 loss shows an average of 6.01 dB SNR and 5.04 dB CNR advantage over L 1 loss, which agrees with the visual inspection. In terms of EPI and CT, there is trivial difference. For SRResNet, L 2 loss shows an average 1.94 dB SNR and 1.39 dB CNR advantage while L 1 loss shows 0.04 EPI improvement. CT difference is small. For AC-SRResNet, L 1 loss shows an average 2.06 dB SNR, 1.16 dB CNR and 0.06 EPI improvement over L 2 loss. CT difference is still small. On average L 2 loss performs better than L 1 loss in removing noise while the edge preservation capability of L 1 loss is better.
Among all three models, SRResNet and AC-SRResNet perform better than Unet for both SNR and CNR while Unet holds consistent high EPI and fastest processing time due to its lightweight among three. Compared to SRResNet, introduction of asymmetric convolution in AC-SRResNet Table 1

Generalization ability test
To evaluate the generalization ability for different OCT system images, we tested the OCT2017 dataset (public dataset) [34]. We selected the noisy images from choroidal neovascularization (CNV), diabetic macular edema (DME), drusen and normal datasets respectively. An exemplary image denoising comparison result of the retina fovea region is shown in Fig. 8. Visual comparison of OCT2017 dataset and our system images show that they are of different contrasts. Visual inspection finds that there are noticeable cloudy artifacts for deep learning models with L 2 loss at the weak boundary pointed out by the arrows on the image. Unet creates the most serious artifacts while AC-SRResNet shows a barely noticeable artifact. On the other hand, all three models trained with L 1 loss output no such artifacts. The reason for artifacts generation might be that L 2 loss seeks average recovery while L1 seeks the medium recovery in principle. Because the image contrast is different for our system and OCT2017 dataset, noises in OCT2017 are not zero-mean distributed that are more suitable to process with L 2 loss. For this reason, only models trained with L 1 loss were tested and compared with BM3D quantitatively on OCT2017 dataset. We can see a better generalization ability of models trained with L 1 loss compared to L 2 loss. The noise reduction results are shown in Fig. 9. All of the images have been clearly denoised.  Parameter comparison results are shown in Table 2. BM3D shows lowest SNR and CNR improvement here. Among deep learning models, we can see AC-SRResNet achieves the highest SNR of 41.80 dB and CNR of 44.64 dB at the cost of longest CT of 0.98 s, while U-net achieves the lowest SNR of 39.01 dB and CNR of 41.06 dB with the advantage of shortest CT of 0.2 s. SRResNet gets a moderate SNR of 39.76 dB and CNR of 42.72 dB. For the EPI values, Unet model tends to achieve a higher EPI value than other two models. The reason might be that the denoising effect of Unet is lowest since the original image with no noise removed will get an EPI value of 1. AC-SRResNet shows consistent detail preservation advantage over SRResNet again. Based on the analysis results of our system images and public dataset, we can see that deep learning models trained based on Noise2Noise principle all achieved good denoising performance. Although speckle pattern is still remaining with the image as expected, certain smoothing effect can be observed, which will reduce the speckle contrast. This effect on future applications such as speckle-based OCT angiography (OCTA) analysis for images in the logarithm domain requires further study in the future. Please note that method proposed in this manuscript denoises OCT images in the logarithm domain. For intensity or amplitude based OCTA analysis might not be able to use this method directly.

Real-time online imaging test results
We tested the online performance of the noise reduction models of Unet, SRResNet and AC-SRResNet trained with L 2 loss for the whole B-mode image with size of 512 × 512 pixels as L 2 loss offers better performance for our system images. The results are shown in Table 3. When tested off-line without considering the image data transfer in memory and format change, Unet, SRResNet and AC-SRResNet based noise reduction model can reach 75 fps, 21 fps and 19 fps respectively. On the real-time online imaging system, Unet, SRResNet and AC-SRResNet based noise reduction model can achieve a real-time maximal 64 fps, 19 fps and 17 fps image denoising without causing program interface freezing respectively. Screen capture of real-time finger nail image noise reduction are shown in Fig. 10 with zoomed area of interest on the right. Visualization 1, Visualization 2, and Visualization 3 are the videos showing image denoising results of Unet, SRResNet and AC-SRResNet for in vivo human hand palm images respectively. The processing speed based on Unet model is the fastest due to the simplicity and relative shallow network depth of the structure.  From the real-time test, we can see that there is a tradeoff between image quality improvement and processing time. As more layers are added to the network feature extraction capability can be increased thus better noise discrimination and reduction can be achieved. However, the side-effect of more time consumption is becoming obvious. Nevertheless, for application scenarios where time consumption might not be a crucial factor, sophisticated network model can be adopted. Meanwhile, there is still improvement for the processing time reduction. Currently there is only one GPU configured for both OCT signal processing and deep learning image denoising. We tested installation of a second GPU and tried to implement the multithread-controlled data parallelism technique to further reduce the processing time. However, current Tensorflow C-API library doesn't support independent configuration of each GPU for noise reduction. Once the library support is ready, we believe further processing time reduction is probable.

Conclusions
We proposed a deep learning-based noise reduction method for OCT images that requires no noise free ground truth images as labels. Comparison study with conventional noise reduction method BM3D and different network structures including Unet, SRResNet and modified AC-SRResNet trained with L 2 and L 1 loss respectively were performed. Further incorporation into online OCT imaging system for real-time noise reduction was demonstrated for images with size of 512×512 at 64 fps for Unet, 19 fps for SRResNet and 17 fps for AC-SRResNet. We believe proposed methods will benefit future responsive deep learning-based OCT signal processing and analysis platform.