Real-time OCT image denoising using a self-fusion neural network

: Optical coherence tomography (OCT) has become the gold standard for ophthalmic diagnostic imaging. However, clinical OCT image-quality is highly variable and limited visualization can introduce errors in the quantitative analysis of anatomic and pathologic features-of-interest. Frame-averaging is a standard method for improving image-quality, however, frame-averaging in the presence of bulk-motion can degrade lateral resolution and prolongs total acquisition time. We recently introduced a method called self-fusion, which reduces speckle noise and enhances OCT signal-to-noise ratio (SNR) by using similarity between from adjacent frames and is more robust to motion-artifacts than frame-averaging. However, since self-fusion is based on deformable registration, it is computationally expensive. In this study a convolutional neural network was implemented to offset the computational overhead of self-fusion and perform OCT denoising in real-time. The self-fusion network was pretrained to fuse 3 frames to achieve near video-rate frame-rates. Our results showed a clear gain in peak SNR in the self-fused images over both the raw and frame-averaged OCT B-scans. This approach delivers a fast and robust OCT denoising alternative to frame-averaging without the need for repeated image acquisition. Real-time self-fusion image enhancement will enable improved localization of OCT field-of-view relative to features-of-interest and improved sensitivity for anatomic features of disease.


Introduction
Optical coherence tomography (OCT) has become ubiquitous in ophthalmic diagnostic imaging over the last three decades [1,2]. However, clinical OCT image-quality is highly variable and often degraded by inherent speckle noise [3,4], bulk-motion artifacts [5][6][7], and ocular opacities/pathologies [8,9]. Poor image-quality can limit visualization and introduce errors in quantitative analysis of anatomic and pathologic features-of-interest. Averaging of multiple repeated frames acquired at the same or neighboring locations is often used to increase signalto-noise ratio (SNR) [10][11][12]. However, frame-averaging in the presence of bulk-motion can degrade lateral resolution and prolongs total acquisition time, which can make clinical imaging challenging or impossible in certain patient populations.
Many computational techniques for improving ophthalmic OCT image-quality have been previously described, including compressed sensing, filtering, and model-based methods [13]. One compressed sensing approach creates a sparse representation dictionary of high-SNR images that is then applied to denoise neighboring low-SNR B-scans [14]. However, this method requires a non-uniform scan pattern to slowly capture high-SNR B-scans to create a sparse representation dictionary, which limits its robustness in clinical applications. A different dictionary-based approach obviates the need for high-SNR B-scans and frame-averaging by utilizing K-SVD dictionary learning and curvelet transform for denoising OCT [15]. Various well-known image denoising filters, such as Block Matching 3-D and Enhanced Sigma Filters combined with the Wavelet Multiframe algorithm, have also been evaluated for OCT denoising [16]. However, these aforementioned methods are computationally expensive, particularly when combined with wavelet-based compounding algorithms. Patch-based approaches, such as the spatially constrained Gaussian mixture model [17] and non-local weighted group low-rank representation [18], have also been proposed for OCT image-enhancement but have similar computational overhead and are, thus, unsuitable for real-time imaging applications.
Deep-learning based denoising has gained popularity in medical imaging [19][20][21] and OCT [22,23] applications. In addition to producing highly accurate results, these methods also overcome the computational burden of traditional methods and enable real-time processing once a model is trained. Convolutional Neural Networks (CNNs) for OCT denoising have included a variety of network architectures such as GANS [11,[24][25][26][27][28], MIFCN [29], DeSpecNet [30], DnCNN [10,31], Noise2Noise [32], GCDS [33], U-NET [34][35][36][37], and intensity fusion [38,39]. The critical barrier to translating deep-learning methods to clinical OCT is the lack of an ideal reference image to be used as the ground-truth. To overcome this limitation, some CNN-based methods use averaged frames as the ground-truth to mimic frame-averaging image-quality. Recently, a ground-truth free CNN method was demonstrated that uses one raw image as the input and a second raw image as the ground-truth [40]. Here, the image-enhancement benefit was sacrificed in favor of faster processing rates. This trade-off highlights the limitations of current-generation deep-learning based OCT denoising methods, which require a large number of repeated input images to compute the ground-truth and a small number of input images to achieve real-time processing rates but can result in artifactual blurring (Table 1). Oguz et al. [41] recently demonstrated robust OCT image-enhancement using self-fusion. This method is based on multi-atlas label fusion [42], which exploits the similarity between adjacent B-scans. Self-fusion does not require repeat-frame acquisition, is edge preserving, enhances retinal layers, and significantly reduces speckle noise. The main limitation of self-fusion is its computationally complexity due to the required deformable image-registration and similarity computations, which precludes real-time OCT applications. Our inability to access imageenhanced OCT images in real-time significantly limits the utility of self-fusion for evaluating dynamic retinal changes. Similarly, because real-time image aiming and guidance is performed using noisy raw OCT cross-sections, it is challenging to accurately evaluate image focus, whether the field-of-view sufficiently covers features-of-interest, and image quality after self-fusion, thus reducing yield of usable clinical datasets. In this study, we overcome processing limitations of self-fusion by developing a CNN that uses denoised self-fusion images as the ground-truth [41]. This approach combines the robustness of self-fusion denoising and the high processing-speed of neural networks. Here, we demonstrate integration and translation of optimized data acquisition and processing for real-time self-fusion image-enhancement of ophthalmic OCT at ∼ 22 fps for an image size of 512 × 512 pixels. While potentially more prone to artifacts from implementation of a neural network, our proposed approach enables real-time denoising of raw OCT images, which can directly be used as an indicator of image quality following artifact-free offline self-fusion processing of the acquired data. Similar strategies have been demonstrated in OCT angiography applications to provide previews of volumetric vascular projection maps in real-time [43,44]. This real-time denoising technology can also enhance diagnostic utility in applications that require immediate feedback or intervention, such as during OCT-guided therapeutics or surgery [45,46].

OCT system
All images were acquired with a handheld OCT system previously reported in [47,48] with a 200 kHz, 1060-nm center wavelength swept-source laser (Axsun) optically buffered to 400 KHz. The OCT signal was detected using a 1.6 GHz balanced photodiode (APD481AC, Thorlabs) and discretized with a 12-bit dual-channel 4 GS/s waveform digitizer board (ATS-9373, AlazarTech). The OCT sample-arm beam was scanned using a galvanometer pair and relayed using a 2x demagnifying telescope to a 2 mm diameter spot at the pupil. All human imaging data were acquired under a protocol approved the Vanderbilt University Institutional Review Board.

Dataset
Volumetric OCT datasets of healthy human retina centered on the fovea and optic nerve head (ONH) were acquired to train and test the self-fusion neural network. OCT optical power incident on the pupil was attenuated (1-2 mW) to simulate different SNR levels. Each volume contained 2500 raw B-scans (500 sets of 5-repeated frames). The repeated frames were averaged and the resulting 500-frame volume was self-fused with a radius of 3 frames (3 adjacent images before and 3 after the current frame for total of 7 images) to achieve high quality ground-truth images to train the neural network. The supplemental Fig. 1 shows the effect of self-fusion with different radii. Sets of 3 raw and non-averaged B-scans (radius of 1) were used as self-fusion neural network inputs to obtain one denoised image. Figure 1 shows examples of raw and corresponding self-fused OCT B-scans used as ground-truth images.
Additional OCT images from external datasets were added as test images. The first set of images was taken from the dataset used to test the Sparsity Based Simultaneous Denoising and Interpolation (SBSDI) method [49]. Two sets of images acquired with different OCT systems (Cirrus: Zeiss Meditec; T-1000 and T-2000: Topcon) were taken from the Retinal OCT Fluid Detection and Segmentation Benchmark and Challenge (RETOUCH) dataset [50].

Network architecture and training
The network was designed and implemented in PyTorch based on the multi-scale U-Net architecture proposed by Devalla et al. [35]. The model was trained on 9 ONH volumes with various SNR and validated on 3 fovea volumes to avoid information leakage (Fig. 2(A)). Additionally, 3 ONH and 3 fovea volumes were used as test data. The self-fusion neural network was trained on a RTX 2080 Ti 11GB GPU (NVIDIA) until the loss-function reached the plateau (30 epochs). Parameters in the network were optimized using the Adam optimization algorithm with a starting learning rate of 1e-3 and decay factor of 0.8 for every epoch. Batches of 3 adjacent registered OCT B-scans (radius of 1) were used as inputs to train the network. Here, the central frame was denoised based on information from the neighboring slices. The number of requisite input B-scans is kept low to achieve video-rate self-fusion processing. Similarly, while deformable image-registration is ideal for denoising and, in this case, used to get the ground truth self-fused images (Fig. 1), a discrete Fourier transform (DFT) based rigid image-registration [51] was adopted as the motion correction strategy to minimize computational overhead. The computational cost of the original self-fusion method using rigid DFT registration or deformable registration (Symmetric Normalization -SyN) ANTsPy [52] is compared in Table 2. The computer used for this test had an 11th Gen IntelCore i7-11700 @ 2.50GHz×16 CPU.

Real-time implementation
Custom C++ software was used to acquire and process OCT images. TorchScript was used to create a serializable version of the self-fusion neural network model that could be used in C++ via LibTorch (C++ analog of PyTorch). The model was loaded in a LibTorch-based module and executed to denoise and display self-fusion denoised OCT images (Fig. 2(B)). The OCT acquisition and processing software consists of a main thread that controls the graphical user interface and image visualization, and two sub-threads running asynchronously to control data acquisition (DAQ) and processing. The DAQ module acquired 16-bit integer raw OCT data with 2560 pixels/A-lines and 512 A-lines/B-scan. Sets of three images were acquired and copies of these images were loaded into 32-bit float LibTorch-GPU tensors. LibTorch-based GPU-accelerated OCT processing pipeline includes: background subtraction, Hanning spectral windowing, dispersion-compensation, Fourier transform, logarithmic compression, and image cropping (Fig. 3). The resulting images were then intensity-normalized, motion-corrected using DFT fast registration, denoised using the pretrained self-fusion neural network, and finally contrast-adjusted using the 1st and 99th percentiles of the image data as lower and upper intensity limits respectively.

Quantitative evaluation
The two most commonly used quantitative metrics for assessment of noise reduction were adopted to evaluate the performance of the self-fusion neural network: Peak-signal-to-noise ratio (PSNR): where max(I f ) denotes the maximum foreground intensity and σ b denotes the standard deviation of the background. Contrast-to-noise ratio (CNR): where µ f and µ b are the mean of the foreground and background, and σ f and σ b . are the standard deviation of the foreground and background, respectively.  Figure 4 depicts a representative set of images of the fovea processed with the self-fusion neural network. All images were contrast-adjusted using the same percentiles. The CNR and PSNR were calculated for the raw, 3-frame average, and 3-frame self-fusion neural network denoised OCT B-scans. The quantitative comparison of frame-averaging and self-fusion with respect to the original raw OCT image is summarized in Fig. 5. Experimental results showed CNR improved by ∼50% and ∼100% for frame-averaging and self-fusion over raw OCT B -scans, respectively. Likewise, self-fusion outperformed frame-averaging by improving PSNR by ∼90% and ∼20%, respectively. The contrast of retinal layers and vessels was improved, which facilitates the identification of anatomical and potentially pathological features. Self-fusion neural network processing time was also compared to off-line processing ( Table 3). The average of individual processing time and frame-rate for the OCT processing, DFT registration, and self-fusion neural network was quantified using 100 frames an OCT testing dataset. The CPU and GPU used for benchmarking were a XEON E5-2630 v4 2.2 GHz (Intel) and GeForce RTX 2080 Ti (NVIDIA), respectively. The results showed that self-fusion neural network can achieve near video-rate performance at ∼22 fps (Visualization 1 in supplemental material). Table 3. Average processing time and frame-rate for OCT processing and self-fusion neural network denoising on CPU and GPU. The total processing time was calculated as the sum of the three most critical processing blocks (OCT processing, DFT registration, and self-fusion NN) plus trivial memory-to-memory data transfer time. For real-time processing, DFT registration was selected over deformable registration to reduce computation time. The results reported in Tables 2 and 3 demonstrate the image processing advantage of the self-fusion neural network over original self-fusion method. As expected, the image quality of self-fused images increases with the use of deformable registration at the expense of increased processing time.

3-Frame DFT Registration
The self-fused images from the external datasets are illustrated in Fig. 6. A set of three raw OCT images from each dataset was registered and self-fused with the neural network. Improvements in image quality and reduction in speckle noise in the SBSDI images self-fused with the neural network show enriched visualization of blood vessels and retinal layers in the fovea. The images taken with the Cirrus system show the retina with improved visualization of macular edema. The images acquired with the Topcon system show abnormal RPE, where drusenoid deposits are clearly seen in the denoised image as a smooth dome-shaped elevation. In all cases, the self-fusion neural network outperformed averaging in terms of CNR and PSNR enhancing retinal features. Therefore, the images self-fused with the neural network provided enhanced Fig. 6. Raw, average, and self-fusion neural network denoised OCT images of external datasets. Images from the SBSDI dataset (first row) and RETOUCH dataset (Cirrus: middle row; Topcon: bottom row). visualization of diagnostic-relevant pathological features in images with different image-quality and limited visualization.

Discussion
Noise and poor image-quality can limit accurate identification and quantitation of pathological features on ophthalmic OCT. While robust denoising methods are well-established, these are limited to off-line implementations due to long processing times and high computational complexity. Real-time OCT image-enhancement is critical for clinical imaging to ensure patient data is of sufficient quality to perform structural and functional diagnostics in post-processing. These benefits are even more critical for OCT-guided applications, such as in ophthalmic surgery where image-quality is degraded by ocular opacities [53][54][55].
Deep-learning methods have shown potential for real-time image denoising. However, existing methods have shown a tradeoff between preserving structural details and reducing noise that can result in over-smoothing and loss of resolution. More importantly, deep-learning based methods require robust training data of ocular pathologies to avoid inclusion of unwanted artifacts. In this study, we implemented real-time OCT image-enhancement at near video-rates based on self-fusion. Self-fusion is more robust to motion-artifacts as compared to frameaveraging and overcomes the need for extensive training by using similarity between adjacent OCT B-scans to improve image-quality. These benefits were confirmed experimentally with our video-rate self-fusion implementation, and we show significant advantages in CNR and PSNR over frame-averaging.
We demonstrated the ability of the self-fusion neural network to denoise OCT images not only from our research grade systems but also from external datasets acquired with different commercial OCT technology. Although the neural network was trained on images from healthy human retina, the denoised external OCT images present relevant pathological features that were enhanced with the neural network such as vascularization, layer detachment, macular holes, and drusenoid deposits.
While the proposed self-fusion neural network outputs suffer from a slight image-smoothing effect produced by convolution and rigid registration when compared to self-fusion, better generalization of the data by using more images, more robust network architectures, data augmentation, a larger training database, and more images as input channels for the neural network may help to preserve features [35]. In addition, the use of more powerful GPUs will enable increasing the number of input images, which can reduce smoothing artifacts without sacrificing processing speed. The proposed method may also be directly applied to OCT variants such as OCT angiography, Doppler OCT, OCT elastography, and polarized-sensitive OCT to improve image-quality and diagnostic utility.
Our results showed a significant improvement in CNR and PSNR in the self-fused B-scans over the frame-averaged and raw B-scans, where reduced speckle noise and improved contrast benefit identification of anatomical features such as retinal layers, vessels, and potential pathologic features. The proposed approach delivers a fast and robust OCT denoising alternative to frame-averaging without the need for multiple repeated image acquisition at the same location. While we expect a few image artifacts from our neural-network implementation, conventional offline self-fusion may be directly applied to corresponding datasets that require quantitative analyses or precision diagnostic feature extraction in post-processing. Real-time self-fusion image enhancement will enable improved localization of OCT field-of-view relative to featuresof-interest and improved sensitivity for anatomic features. Disclosures. The authors declare no conflicts of interest.