Resolution enhancement and realistic speckle recovery with generative adversarial modeling of micro-optical coherence tomography

A resolution enhancement technique for optical coherence tomography (OCT), based on Generative Adversarial Networks (GANs), was developed and investigated. GANs have been previously used for resolution enhancement of photography and optical microscopy images. We have adapted and improved this technique for OCT image generation. Conditional GANs (cGANs) were trained on a novel set of ultrahigh resolution spectral domain OCT volumes, termed micro-OCT, as the high-resolution ground truth (~1$\mu$m isotropic resolution). The ground truth was paired with a low-resolution image obtained by synthetically degrading resolution 4x in one of (1-D) or both axial and lateral axes (2-D). Cross-sectional image (B-scan) volumes obtained from in vivo imaging of human labial (lip) tissue and mouse skin were used in separate feasibility experiments. Accuracy of resolution enhancement compared to ground truth was quantified with human perceptual accuracy tests performed by an OCT expert. The GAN loss in the optimization objective, and noise injection in both the generator and discriminator models were found to be important for achieving realistic speckle appearance in the generated OCT images. Qualitative examples applying the models to image data from outside of the training data distribution, namely human retina and mouse bladder, were also demonstrated, suggesting potential for cross-domain transferability. This preliminary study suggests that deep learning generative models trained on OCT images from high-performance prototype systems may have potential in enhancing lower resolution data from mainstream/commercial systems, thereby bringing cutting-edge technology to the masses at low cost.


Introduction
Optical coherence tomography (OCT) is a 3-dimensional optical imaging technique, and has become part of the standard of care in ophthalmology [1] while growing in importance in other clinical specialties such as gastroenterology [2]. Axial resolution of OCT is governed by the light source bandwidth, while lateral (transverse) resolution is governed by the numerical aperture (NA) of the illumination beam [3]. Hardware efforts to improve axial and lateral resolution can be complex, requiring high performance lasers and imaging objectives, and dispersion matching in the reference and sample paths. Computational techniques have been employed to overcome these constraints. For axial resolution, dispersion mismatch can be corrected by compensation algorithms to restore resolution to the ideal limit set by the light source; recent studies have proposed methods to surpass that limit [4]. For lateral resolution, traditional deconvolution techniques such as the Richardson-Lucy algorithm have been suggested [5], while physics-based algorithms such as interferometric synthetic aperture microscopy have also been successful [6].
Deep learning has found most success in image classification and feature detection tasks [7], and has also had a growing impact in computational imaging and inverse problems [8] such as resolution enhancement (known as 'super-resolution' in the computer vision literature), notably in optical microscopy [9] where deep learning has been used to improve image quality [10]. Generative adversarial networks (GANs) [11], an emerging branch of deep learning, has shown promise in a wide range of imaging applications. GANs use two powerful neural networks competing with each other to greatly enhance the quality and realism of machine-generated images, potentially performing better than a single neural network alone or blind techniques without data priors. Conditional GANs (cGANs) are a flavor of GANs that learn to generate a mapping between two domains by training on image pairs, where a 'conditional' image in one domain is co-registered with a ground truth image from another domain [12]. These techniques have been investigated in photography, microscopy [13], as well as OCT studies [14,15], which focused on denoising and deblurring of commercial ophthalmic OCT images, rather than the learning of higher optical resolution from prototype OCT systems.
In this work we explore the hypothesis that cGANs can be used to enhance the axial and lateral resolution of OCT images, trained on an ultrahigh resolution OCT ground truth. Using images obtained by micro-OCT [16,17] with ∼1µm resolution, axial and lateral resolutions were synthetically degraded by windowing/averaging the interference spectra, producing an intrinsically co-registered set of paired low-high resolution data for training. Injection of noise in the cGAN architecture was found to substantially improve the quality of image generation. Comparisons were made between our approach and several previously reported techniquesclassical blind deconvolution without deep learning (Richardson-Lucy deconvolution), a state of the art non-adversarial deep learning approach, and a vanilla cGAN with no noise injection. Models were separately trained on two datasets -mouse skin and human lip. Three use cases were investigated -the conversion of 1-dimensional low resolution (axial or lateral) to high resolution, and the conversion of 2-dimensional (axial+lateral) low resolution to high resolution. The 2-D case was further investigated for the realism of the speckle reconstruction, using a perceptual quality test performed by an OCT expert, where our GAN approach was found to perform better than previous techniques. We also report training details and hyperparameter heuristics that are specific to OCT image generation. Lastly, we show qualitative examples of our models performing enhancement on OCT data from outside of the training data distribution, namely human retina and mouse bladder, suggesting potential for cross-domain transferability.

micro-OCT image data and pre-processing
Images were obtained using a prototype micro-OCT system previously reported [18], with axial scan rate of 60 kHz, axial resolution 1.3 µm (tissue) and lateral resolution 1.8 µm. Two datasets were investigated -10 volumes of mouse skin images, and 8 volumes of human labial (lip) mucosa images, acquired in vivo by a handheld probe and reported in an earlier publication [18]. Each volume had dimensions ∼ 800 × 1000 × 500 (∼500 B-scans per volume). The pixel size was 0.4 µm (axial) and 0.8 µm (lateral). To generate realistic low axial resolution images, a tight Gaussian window with full width at half maximum (FWHM) set to 25% of the source bandwidth was applied to the raw k-space interference fringe data, degrading the axial resolution to ∼5 µm while preserving the depth dimension. To generate low lateral resolution images, the fringes were moving-averaged over 6 A-scan lines, corresponding to ∼5 µm in the lateral direction. For 2-D (axial and lateral) low resolution, the fringes were windowed then moving-averaged. Thus the low resolution images were intrinsically co-registered with the high resolution images, with the same pixel dimensions. The images were then cropped to non-overlapping 256 × 256 image patches for model training, with deep low signal regions discarded. The skin volumes were split into 7 volumes for training (28,728 training image patches) and 3 volumes for validation (12,312 validation image patches). The lip mucosa volumes were split into 5 volumes (30,780 training image patches) for training and 3 volumes for validation (18,468 validation image patches). Different models were trained on two versions of data -single frame data, and 3-frame moving averaged data. The 3-frame averaging served as a simple denoising technique that improved the perceptual quality of the images, and is a standard practice in OCT processing when some speckle reduction is preferred.

cGAN architecture and training
A cGAN architecture was used for the image enhancement deep learning model (Fig. 1). In this architecture, two powerful neural networks learn from each other, thereby improving the quality of the outputs. A 'generator' neural network learns from paired training data to produce an enhanced image from a 'conditional' input image relative to a ground truth, while a 'discriminator' neural network learns to discriminate whether the generator's output is a generated 'fake' or a genuine ground truth, then returning feedback to the generator. As training of the generator and discriminator models is performed alternately, the two models compete till a theoretical limit where the generated images are indistinguishable from the ground truth, although in practice the generated quality does not necessarily converge to an optimum. This previously reported cGAN design is widely known as 'pix2pix' [12] and has several open-source skeleton implementations generously made available by the machine learning community [19,20]. We made specific modifications as described below.
The generator used a 'U-Net' architecture [21] where a series of downsampling and upsampling convolutional paths with skip connections capture patterns at various levels of abstraction. A U-Net is traditionally used for image segmentation, but has also proven effective for the generation task. Recent deep learning papers have suggested that a deeper generator comprising multiple residual network (ResNet) blocks might have superior performance [22], but we did not observe significant differences with this on our training data. The generated image was fed to the discriminator, which was two 'patchGAN'-style classifiers operating at two image scales [12,22]. The receptive field of each pixel in the discriminator output was designed to be small (15 and 30 pixels width) relative to the input, such that finer details at the level of speckle might be evaluated by the discriminator. The GAN objective was regularized by an L1 (pixel-wise mean absolute difference) loss as follows: L G AN + λL 1 where λ was a hyperparameter set to 10. Larger values of λ up to 100 were suggested in prior studies using photographic data [12], but we found these to be prone to poor speckle generation and blurry images (Fig. 2). We also experimented with using an additional Difference of Structural Similarity (DSSIM) loss term as suggested in the literature [13,23] but this showed little improvement for OCT data and also produced blurry images; we have observed (Fig. 2) that SSIM may be a poor training objective and evaluation metric for OCT generation. Other training heuristics previously recommended for GANs such as normalization of inputs between -1 and 1, soft and noisy labels, and training the generator for 2-3 iterations for every discriminator iteration were used [24]. Models were trained for 30 epochs. To further improve the quality of the images, Gaussian noise with standard deviation 0.1 was injected at every level of the generative upsampling, as well as at the input to the discriminator, as has been suggested by prior GAN reports [25][26][27]. The same model architecture and hyperparameters were used for enhancing both human lip and mouse skin data. Even though model training was performed on image patches, the fully convolutional nature of the generator model (with no fully connected layers) enabled the use of image inputs of any size, therefore the full-sized original images could be used in the generator at model prediction time.

Perceptual accuracy test by human OCT expert
In order to evaluate the quality and realism of the computational reconstructions, a human reader was asked to evaluate the 2D-enhanced images. In typical GAN studies from the deep learning literature, human readers are crowdsourced from online platforms such as Amazon Mechanical Turk to evaluate the quality of generated photographs or artwork, but this is not feasible for specialized imaging data such as OCT. For our study, the reader was an OCT expert (coauthor X.L.) who was involved in planning the study and preparing the data, but was not involved in the machine learning, was blinded to the models, and had not seen the model-generated results beforehand. Image patches (256 x 256 pixels) were shown to the reader one at a time, and the reader given two seconds to evaluate. In a 'paired' test, a generated image was shown with its ground truth image side by side and the reader asked to identify the real image. In an 'unpaired' test, a single image, either generated or ground truth, was shown one at a time and the reader asked to determine its identity. Two seconds was longer than a typical GAN perceptual test, which gives only one second [26], so as to account for the complexity of a typical OCT image. Before commencing the test, 5 practice examples were showed, each followed by the answer. This was then followed by a test of 50 questions in sequence, with no answers shown. After each test, a 'confusion score' was computed as the fraction of incorrectly read images over 50 total images, giving a percentage. Higher confusion scores closer to 50% would indicate that many images were incorrectly read, suggesting that generated images were realistic and nearly indistinguishable from real high-resolution images (nearly random chance). Lower confusion scores closer to 0% would indicate that most images were correctly read, suggesting that generated images were easily distinguished from real images. The confusion score of the GAN-generated results was compared to separate tests on images produced by a state-of-the-art Unet (non-adversarial training) originally designed for improving signal to noise ratio and image quality of microscopy images [10], as well as a vanilla conditional GAN [12] without the additional injection of noise.

Cross-domain validation on real data
As a preliminary qualitative assessment of the models' performance on real (not simulated) images from a different data distribution from the training data, images of normal (no pathology) human retinal images [28] and mouse bladder tissue [29] were obtained from freely available datasets that accompanied published papers. Retinal images were acquired on a Zeiss Cirrus ophthalmic spectral domain OCT system, with axial and lateral resolution 5 µm and 15 µm respectively, and bladder tissue images were acquired on a Bioptigen Envisu R-class pre-clinical imaging system, with axial and lateral resolution 0.9 µm (tissue) and 8.5 µm respectively. It was necessary to resize the input images such that the size of speckle was approximately similar to that of the training data. The generator model, like most modern neural networks, used convolution operations, for which the convolutional filters had been learned from training data based on the length scales (measured in number of pixels) of image features including speckle noise. Therefore it was necessary for the images entering the generator to have roughly the same length scales of speckle learned by the generator's convolutional filters (Appendix). For datasets dissimilar to the training data, images were resized to 4x larger in pixel dimensions, using bilinear interpolation, before entering the generator. Low-signal regions deep in the images were cropped, and the images were marginally resized to have dimensions of a multiple of 256, for ease of input to the trained model. Since higher-resolution ground truths were not available, generated results were qualitatively assessed.

Results and discussion
We have developed a deep learning based algorithm for resolution enhancement of OCT images, based on previously reported techniques in generative adversarial networks. Using very high resolution OCT images as a ground truth, 4x improvement in resolution was demonstrated on images with synthetic resolution degradation. As with typical GAN generation, objective evaluation of the generated outputs was challenging. Given the speckle noise that is inherent to coherent imaging such as OCT or ultrasound, the model was not able nor expected to exactly reproduce the noise content of the ground truth images. Therefore, conventional similarity metrics such as Structural Similarity (SSIM) gave low scores. Excessive regularization produced smoothed, speckle-reduced images with poor resemblance to OCT but still resulted in higher SSIM scores (Fig. 2). Reduced regularization produced speckle noise that appeared qualitatively realistic, suggesting that the noise distribution of the speckle was learned, while the exact details of the generated speckle pattern was different from the ground truth. The generation of realistic yet accurate speckle may be necessary in some specific contexts, and is an interesting possibility for future investigation. Large regularization also seems to suggest itself as a means of speckle noise reduction, although this needs more careful validation and was not the objective of this work.  Table 1. The realism of generated speckle appeared to be high. However, it can be observed that the detailed content of the generated speckle pattern can differ from the ground truth, particularly in the 2-D case where the inference space is larger ( Figure 5), even though the larger scale features are preserved and have improved quality. This difference in generated noise pattern, especially in the low-signal (nearly black) background of the images where noise can originate from the laser or other system sources, can lead to poor results when standard quantitative pixel-level similarity metrics such as SSIM are used.
Human perceptual accuracy tests were preferred for evaluating the quality of the enhancement (Table 1). Examples from a range of algorithms are presented in Figures 6 and 7. The Richardson-Lucy technique was generally poor (Appendix) and thus deemed not sufficiently competitive for a human perceptual test. The non-adversarial Unet and vanilla cGAN (no noise injection) produced images that were easily discriminated by a human OCT expert (0% confusion). The noise-injected cGAN confusion scores were substantially higher. The unpaired test results were lower than paired results, which was surprising and opposite to typical GAN studies [26] where readers found single images more confusing. This may be due to our reader being a subject-matter OCT expert, such that in the absence of a confusing alternate image, he was able to tap on pre-existing specialized knowledge of OCT to distinguish realistic speckle. Results from the 3-frame averaged images showed better quality (higher confusion). In practice, the interpretation of OCT images often involves an averaging/denoising process where speckle noise is intended to be suppressed; training a model on denoised data could allow the model to focus on more important image features rather than speckle noise, which is challenging to reproduce.
Noise injection in the architecture was important for quality and realism of the reconstruction. Some examples of low quality images generated by a vanilla cGAN (no noise injection) are shown in Figure 7. The speckle pattern had a repeated grid-like artifact, severely affecting the realism of the images. The images also sometimes showed a noise pattern resembling speckle noise but repeated in most/all generated images (figure insets). This pattern might appear realistic on single images, but was quickly detected by the human reader as a generative artifact when observed over a large number of images from the same generator during a perceptual test.  The adversarial component of the algorithm appeared to be particularly important for OCT generation; the baseline non-adversarial Unet approach has been reported to be successful in microscopy for denoising, increasing signal-to-noise ratio and sharpening, but produced less realistic OCT images than our cGAN approach. Examples of images produced are shown in Figure 6. This was in agreement with most computational super-resolution studies [30,31], which have favored GANs using adversarial learning. Our study had some important limitations. Small numbers of datasets were used in these proof-of-concept experiments, limiting the generalizability of the findings. Low-resolution training data was synthetically created, based on simple operations of spectral cropping and averaging, which may not sufficiently simulate low-resolution images from real-world conventional OCT systems. Our network architecture mimicked a standard end-to-end learning design used in conventional GANs for artwork/photography, and did not incorporate physical or optical models of OCT image formation. Hybrid physics-inspired learning algorithms as suggested in recent computational optics studies [32] may potentially improve performance.
As a preliminary step towards application on real data, publicly available retinal OCT images [28] and mouse bladder OCT images [29] were enhanced (Fig. 8) using the model that performed best on perceptual tests (Table 1), the mouse skin model. Axial resolution was visibly enhanced, but lateral resolution enhancement was marginal. In the future, a more realistic simulation of low lateral resolution in the training data (rather than simply a moving average of spectra) could further improve performance. In the retina, the resolution of layers particularly the inner/outer segments and retinal pigment epithelium (highlighted by inset in Figure 8) was enhanced, which could have relevance to clinical thickness measurements that have been proposed in previous high-resolution OCT studies [33]. The model performance was found to be sensitive to the input image size; using the original image dimensions produced low quality results. We postulate this to be due to the size and scale of specific image features (e.g. speckle) that are learned from the training data (Appendix). The speckle size of input images should at least roughly match that of the training data; more robust protocols for domain transfer will be developed in future work.
These early proof-of-concept experiments suggest the possibility of packaging the high performance of a prototype imaging system as a low cost software-based image enhancement tool that may be used by scientific/clinical peers who lack access to cutting-edge hardware. As long as the neural network model is trained on a data distribution that is identical or very similar to the intended test usage (e.g. imaging of the same organism and organ under similar conditions), the model can be expected to generate high-quality enhancement results. Even cross-domain OCT applications seem feasible in principle, based on our preliminary experiments, although results will be more variable and will require careful validation. Concerns of super-resolution inference leading to 'hallucinatory' artifacts have been reported [34], which should not dampen enthusiasm for this research direction but motivate validation by human readers and comparisons with ground truth images. This concept may also be relevant to high speed swept source OCT systems [35] that typically have lower optical bandwidth and thus worse axial resolution than spectral domain systems. The possibility of having 'the best of both worlds' of OCT systems combining high speed and high resolution is an intriguing avenue of future investigation.

Conclusion
In this proof-of-concept study, the feasibility of axial and lateral resolution enhancement of OCT images using a generative adversarial network was investigated. A high resolution ground truth acquired with micro-OCT, paired with simulated low resolution image inputs were used to train a neural network to generate resolution-enhanced outputs. Results were evaluated by a human OCT expert for perceptual realism. Preliminary cross-domain experiments were performed on image data from outside of the training data distribution. Future work will involve the acquisition of more realistic training data, such as true low lateral resolution images taken with low numerical Fig. 8. Preliminary experiments with cross-domain application, applying a model trained on mouse skin microOCT to human retinal images (courtesy of [28]) and mouse bladder images (courtesy of [29]) from commercial OCT systems. aperture optics and low axial resolution images taken with reduced source bandwidth, larger amounts of data with more variety of quality including typical imaging artifacts, and studies of cross-domain transferability and robustness.