Employing texture loss to denoise OCT images using generative adversarial networks

OCT is a widely used clinical ophthalmic imaging technique, but the presence of speckle noise can obscure important pathological features and hinder accurate segmentation. This paper presents a novel method for denoising optical coherence tomography (OCT) images using a combination of texture loss and generative adversarial networks (GANs). Previous approaches have integrated deep learning techniques, starting with denoising Convolutional Neural Networks (CNNs) that employed pixel-wise losses. While effective in reducing noise, these methods often introduced a blurring effect in the denoised OCT images. To address this, perceptual losses were introduced, improving denoising performance and overall image quality. Building on these advancements, our research focuses on designing an image reconstruction GAN that generates OCT images with textural similarity to the gold standard, the averaged OCT image. We utilize the PatchGAN discriminator approach as a texture loss to enhance the quality of the reconstructed OCT images. We also compare the performance of UNet and ResNet as generators in the conditional GAN (cGAN) setting, as well as compare PatchGAN with the Wasserstein GAN. Using real clinical foveal-centered OCT retinal scans of children with normal vision, our experiments demonstrate that the combination of PatchGAN and UNet achieves superior performance (PSNR = 32.50) compared to recently proposed methods such as SiameseGAN (PSNR = 31.02). Qualitative experiments involving six masked clinical ophthalmologists also favor the reconstructed OCT images with PatchGAN texture loss. In summary, this paper introduces a novel method for denoising OCT images by incorporating texture loss within a GAN framework. The proposed approach outperforms existing methods and is well-received by clinical experts, offering promising advancements in OCT image reconstruction and facilitating accurate clinical interpretation.


Introduction
Optical coherence tomography (OCT) is a commonly used clinical imaging modality, specially in the field of optometry and ophthalmology where it routinely used to capture images of the eye.Speckle noise is inherent to this imaging modality [1], which unfortunately create challenges for clinical interpretation.The statistical properties of OCT images and the associated speckle noise have been studied in detail in [2,3].The presence of speckle noise can occlude important features (such as pathology and retinal landmarks) as well as interfere with accurate segmentation of retinal layers [4][5][6], which is critical to extract quantitative biomarkers from these images and as a result for clinical decision making [7].
A wide range of strategies has been proposed to address the issue of noise in OCT images, encompassing traditional methods as well as deep learning techniques utilizing CNNs and GANs.Traditional methods can be further classified based on whether they employ a single frame or multiple frames for denoising.Examples of single frame denoising techniques include those proposed by Rogowska et al. [8], Wong et al. [9], Bernardes et al. [10], Puvanathasan et al. [11], Habib et al. [12], Hongwei et al. [13], Kafieh et al. [14], and Chong et al. [15].And examples of multiple-frame denoising techniques include those proposed by Chitchian et al. [16], Fang et al. [17], and Fang et al. [18].However, traditional methods often require manual parameter selection and lack adaptability to different noise levels.
In recent years, deep learning methods based on CNNs have emerged as a promising alternative, surpassing the performance of traditional methods.For instance, Shi et al. [19] designed a deep learning network called DeSpectNet for speckle noise reduction in retinal OCT images.They investigated the impact of L 1 and L 2 losses and found L 1 loss to be superior in terms of visual quality and quantitative indices.However, some denoised images produced by DeSpectNet exhibited significant blurriness.
To address the blurring issue, Qiu et al. [20] introduced a CNN-based denoising network that incorporated a perceptually-sensitive loss, the multi-scale structural similarity index (MS-SSIM).Their approach achieved lower levels of blurriness and improved perceptual representation of denoised OCT images, outperforming traditional L 1 and L 2 losses.The authors also reported enhanced contrast between retinal layers and the background in the denoised images.
Additionally, Mehdizadeh et al. [21] demonstrated the effectiveness of deep feature loss, utilizing the internal activations of pretrained deep neural networks such as VGG, for CNN-based OCT denoising.This method outperformed traditional loss functions such as L 1 and L 2 .However, denoised images obtained using deep feature loss could exhibit unwanted artifacts in form of mesh-like patterns.
Recent studies utilized GANs to reconstruct OCT images.Denoising-GAN (DN-GAN) by Chen et al. [22] employed a GAN to reconstruct OCT images.In their approach they combined the adversarial loss with content loss ((L 1 and L 2 )) and perceptual loss (deep feature loss).Overall, the GAN method presented an improved OCT denoising compared to traditional methods and CNN-based methods.However, the results showed that in some rare cases DN-GAN presents certain limitations in preserving structural information of the posterior part of the OCT (i.e.choroidal tissue).
A recent study by Nilesh et al. [23] introduced Siamese-GAN network to reconstruct OCT images.In their approach they constructed a conditional GAN (cGAN) comprising of residual UNet (ResUNet) generator [24], and a Wasserstein GAN (WGAN) [25] discriminator.They complemented their cGAN network with a Siamese twin network [26] to better facilitate generating realistic looking denoised OCT images.In their experiments, the authors investigated the use of UNet [27] versus ResUNet as generators, and they reported that ResUNet presented a superior performance to UNet.Furthermore, they investigated perceptual loss versus mean square error (MSE) on the training of cGAN networks.They reported that MSE is a better loss for generating denoised OCT images.Overall the Siamese-GAN (ResUNet-WGAN-Siamese) show improved peak signal to noise ratio (PSNR) compared to some other GAN-based networks while they reported inferior texture preserving index (TP), and edge preserving index (EPI) compared to CNN based OCT denoising approaches.
Isola et al. [28] introduced an image-to-image (I2I) transformation framework, pix2pix, designed based on Markovian model principals.Pix2pix network managed to generate impressive results with realistic looking textures and detail.The authors reported that participants who were blinded to the origin of the generated images, still categorized as real some of the images generated by their framework.Pix2pix consists of a lightweight UNet generator and a Markovian discriminator, called PatchGAN.The authors demonstrated success of pix2pix across multiple applications of I2I transfer with realistic texture.
Despite all the encouraging results and all the progress in the field, generating realistic looking denoised OCT images that preserve the tissue structure and exhibit a texture similar to the averaged OCT image still represents a challenge.Motivated by the work by Isola et al. [28] in this study, we investigate the application of OCT image reconstruction employing PatchGAN as a texture/style loss in an I2I transformation setting.We construct a cGAN networks with lightweight UNet and ResUNet generators and pair them with PatchGAN discriminator to gain more insight into the texture synthesis of PatchGAN for the purpose of OCT image reconstruction.We compare the performance of our work with the recently proposed SiameseGAN.Additionally, through our experiments, we systematically compare the performance of WGAN and PatchGAN discriminators.This study also explores the use of lightweight UNet versus ResUNet as generators in the overall performance of a cGAN to reconstruct OCT images.Furthermore, we report on the effect of perceptual loss and MSE on the performance of the networks.We provide quantitative and qualitative analysis of the reconstructed OCT images through various cGAN networks.
In summary, the main contributions of this work are as follows: • Explore the effect of PatchGAN as a texture loss for OCT image denoising The remainder of this paper is structured as follows, section 2 provides the background information, section 3 presents details of the proposed methodology, section 4 provide the experimental results and evaluation, and sections 5 and 6 are discussion and conclusion, respectively.

Image-to-Image translation
Image-to-image (I2I) translation [29,30] is a class of learning tasks that transforms images from a source domain to a target domain while preserving the content.Applications include but are not limited to day-to-night, label-to-image, photo-to-paining and image colourisation.

GAN as a general solution for I2I
GAN [29] can be used as a versatile framework for generating images through the process of image-to-image (I2I) translation, allowing the learning of mappings from a source domain to a target domain.The generative model in GAN assumes a particular image distribution and learns to approximate it during training, enabling the generation of realistic images instead of solely classifying existing ones.
The main idea of GANs is to establish a zero-sum game between the two networks (players), namely, a generator G and a discriminator D. Each network is represented by a differentiable function controlled by a set of parameters.The generator G learns to generate fake but plausible images, while the discriminator D learns to distinguish between the fake and real images.The solution of this game is to find a Nash equilibrium between the two networks.
The generator G inputs a random noise vector z sampled from model's prior distribution p(z) to generate an image G(z) to fit the distribution of real images.Then, the discriminator D takes a random real image x from the dataset and the synthetic image G(z) as inputs and outputs a probability between 0 and 1, indicating whether the synthetic image is a real or fake image.In other words, D wants to discriminate the synthetic image G(z) with the real image x, while G intends to generate synthetic image to confuse D. According to Goodfellow et al. [31], the objective of GAN can be expressed as min where By optimizing the objective function, the generator aims to generate data that the discriminator is more likely to classify as real, while the discriminator aims to correctly distinguish between the real and generated data.The training process involves iteratively updating the parameters of the generator and discriminator to improve their performance until equilibrium is reached.

Conditional GANs
In traditional GANs, there is no control on the content of the generated image because the only input to the network is the random noise vector z.To solve this issue Mirza et al. [32] introduced a conditional version of GANs, where both the generator and discriminator are conditioned using additional information y.The conditional input y can be coded using various information, such as data labels, text and attributes of image.In this research, we input the averaged OCT image as the conditional input to the generator and discriminator.According to Mirza et al. [32], the objective of cGAN can be expressed as min (2) In our study, the goal of the cGAN is to find an equilibrium where the generator produces realistic denoised OCT image sample that is conditioned on the equivalent "averaged" OCT image, and the discriminator is unable to distinguish between real "averaged" OCT images and the generated denoised OCT images.The training process involves iteratively updating the parameters of the generator and the discriminator to improve their performance and reach equilibrium.
In cGAN setting, image restoration loss is added to the discriminator loss to measure the quality of the match between the input image and the generator's output image.Two of the commonly used losses are: (i) perceptual loss [33], which is the sum of the squared differences of features extracted from a pre-trained network such as VGG [34] network, and (ii) mean-squared-error (MSE) loss.Perceptual loss has shown success in preserving structural content [21,23] in OCT images and is calculated using the pretrained VGG-19 network on ImageNet [35].
Where w, h, d present the width, height, and depth of the convolutional layers respectively.The MSE measures the average squared difference between the estimated values and the actual values and the MSE loss is the average squared difference of pixels of two images y and x.
Where n is the number of pixels in the image, y is the gold standard averaged OCT image and x is the output image.In our experiments, we computed the perceptual loss by extracting features from the block3 conv3 layer, aligning with the experimental approach of SiameseGAN to maintain methodological continuity.

Methodology
In this section, we outline the structural details of the generators and discriminators utilized in designing the cGAN I2I networks for our study.First, the lightweight UNet and ResUNet models, which are considered as generators, then the PatchGAN and WGAN classifiers that are considered as discriminators, are described in detail.

Lightweight UNet model
The original UNet [27] was proposed by Ronneberger et al. for biomedical image segmentation.
The cGAN pix2pix has adopted the UNet model to operate as generator.This variant of UNet used has a highly lightweight design compared to the original UNet architecture.It uses only one convolution in each encoder and decoder components as opposed to two.Various studies have proved that this lightweight version produces comparable results to the original UNet model [36].The lightweight UNet is beneficial for applications that have limited training data, which can occur in some medical imaging applications.The lightweight UNet architecture consists of seven encoder layers, one bottleneck layer, and seven decoder layers (Fig. 1).The encoder layers consist of a convolution, followed by a batch normalization, and Leaky ReLU (Conv+BN+LReLu) operations.The decoder layers consist of a transpose convolution, followed by a batch normalisation, and ReLU (ConvT+BN+ReLu) operation.The bottleneck layer serves as a bridge between the encoder and decoder units.The first encoder layer and the last decoder layer do not contain batch normalization operations.The bottleneck layer consists of a convolution followed by a ReLU Conv+ReLu operation.For all convolutions, stride is set to 2, and the kernel sizes is set to 4 × 4 (illustrated in Fig. 1).

Deep residual UNet
The deep residual UNet (ResUNet), which was proposed by Zhang et al. [24], is a variant of UNet that uses the residual connections for better information flow and facilitate the training of the network.ResUNet consists of three parts: encoder, bottleneck, and decoder.All the three parts are built with residual unit (Fig. 2).Each residual unit consists of two 3 × 3 convolutional layers followed by batch normalization and ReLu activation layers.The identity mapping function contains a 1 × 1 convolution and a batch normalization operation.We utilize a 7-level architecture of deep ResUNet in our experiments.As shown in Fig. 2 the network has three residual units in encoder section, one residual unit in bottle neck, and three residual units in decoder section.The encoder section encodes the input image into compact representations with stride of s = 2 in the convolutional layers to downsize the feature maps by half.The corresponding decoder path uses up-sampling of feature maps and a concatenation of feature maps from the corresponding encoder path before each residual unit.In the decoder path, the stride of s = 1 is used in the residual units.After the last residual unit in decoder path, a 1 × 1 activation followed by a Tanh activation layer is used to channel the multi-layer feature maps back to pixel level details of the reconstructed OCT image.The detail of each layer, including the size of feature maps at each layer is illustrated in Fig. 2. [37] as a solution to the vanishing gradients problem [38], where gradients in neural networks become extremely small during training, leading to slow or ineffective learning.WGAN addresses this by estimating the Wasserstein distance, measuring the difference between the distributions of real and generated samples, to evaluate the authenticity of an image.The authors demonstrated that WGAN exhibits improved stability and The objective function of WGAN is formulated as:

Arjovsky et al. introduced Wasserstein GAN (WGAN)
where the first two terms estimate the Wasserstein distance; the last term is a regularization term and provides gradient penalty.λ is a constant weighting parameter, x is uniformly sampled along straight lines that connect pairs of generated z and real samples x.The schematic diagram of the WGAN is provided in Fig. 3.

PatchGAN
The PatchGAN [28] penalizes structure at the scale of patches to ensure high frequency correctness.It tries to classify each patch of N × N as real or fake.The final output is the result of averaging of all convolutional responses run over image patches.PatchGAN input is real and fake image pairs.It has five convolutional layers.After the last layer, a convolution is applied to map the patch responses to a one-dimensional output, followed by a Sigmoid operation.The design of PatchGAN is dependent on receptive field, i.e., the size of patch p.In this experiment, a size of p = 70 × 70 was employed as it was determined to be the optimal choice by the authors.The receptive field denotes the relationship between one output activation of the model to an area of the input image.The schematic diagram of PatchGAN is illustrated in Fig. 3.The PatchGAN discriminates on image patches whereas WGAN discriminates on the whole image as either "real" or "fake".The PatchGAN input is noisy and averaged OCT image pairs, whereas the WGAN input is only the noisy OCT image.Both discriminators have similar architecture, with the same layers.However, there are two main differences that make each network unique.The first is the input layer, which is responsible for taking in the data.The second difference is the final layer, which is responsible for producing the output.These small design choices have a significant impact on the two network's functionality and purpose.
The objective function of PatchGAN is formulated as: Where the first two terms are conditioned by x, compared to the unconditional variant in which the discriminator does not observe x (Eq.( 5)).
Figure 3 illustrates a comparison of PatchGAN with the WGAN discriminator.The two networks have similar architecture of Conv+BN+LReLU layers.However, the two networks differ firstly in that the input of the PatchGAN is "conditioned" [28] with the "averaged" OCT images.This makes PatchGAN a conditional discriminator, whereas the WGAN is a discriminator not conditioned on the target images.Secondly, the final real/fake output of PatchGAN is a majority of votes on the realness/fakeness of input image patches, whereas the WGAN final real/fake output is the classification on the whole image.

Experimental dataset
Our dataset comprises foveal-centered OCT retinal scans of 226 children aged between 4 and 12 years with normal vision in both eyes and no history of ocular pathology [39].To acquire the images, a spectral domain OCT instrument (Copernicus SOCT-HR Optopol Technology SA, Zawiercie, Poland) was used.The dataset is ideal for the experiments because it consists of multiple OCT noisy B-scans captured in the same retinal location as well as and the corresponding averaged "noise-free" image pairs (Fig. 4).The original image size is 999 × 868 pixels.The averaged image is obtained by image registration and averaging as presented by Alonso-Caneiro et al. in [40].All the OCT images were resized to 512 × 512 pixels to suit the network input.The input to all networks were identical, consisting of 1460 noisy and averaged OCT image pairs.The images were fed to the networks as whole images in gray-scale format.

Experimental settings
We built six networks by varying the selection of generators (UNet and ResUNet), discriminators (PatchGAN and WGAN), and utilizing different loss functions (Perceptual and MSE for WGAN, and MSE for PatchGAN, as the latter can only trained with MSE loss).Additionally, for comparative analysis, we assessed the performance of the SiameseGAN network on our OCT image dataset.The authors provided the code implementation, which was accessible at https://github.com/sml-iisc/SiameseGAN.Table 1 lists the seven networks and their constituent components.

Super-computing infrastructure
The deep learning frameworks were implemented in Tensorflow and all experiments were conducted on 1 GPU node on the Bracewell CSIRO super-computing facility where each node consists of 2x Intel Xeon Broadwell E5-2680 v4, 14-core CPUs (28 cores total) @ 2.4 GHz (nominal), 256 GB of RAM and 4x NVIDIA Tesla P100s with each card having 16 GB memory.We trained each network for 100 epochs.For every epoch the trained network was tested with a set of unseen validation images.At every epoch the network weights were saved, and the PSNR of the validation images were calculated.

Evaluation metrics
We evaluated the reconstructed OCT images using both quantitative and qualitative measures.Following a similar methodology to our previous work [21], the reconstructed OCT images were evaluated with some well known metrics such as peak signal-to-noise ratio (PSNR), structural similarity (SSIM), texture preservation (TP), and edge preservation (EP) indexes.In addition, two no-reference image sharpness metrics: just noticeable blur (JNB) [41] and spectral and spatial sharpness (S3) [42] are reported in our experiments.

Expert feedback qualitative analysis
To qualitatively assess the performance of each network in reconstructing OCT images, six ophthalmologists who were not involved in the development of the networks were asked to provide feedback based on certain criteria.These expert clinicians all have a background of at least 2 years in retinal image analysis and acquisition.
Three evaluation tests were specifically designed to gather expert user feedback and assess the performance of the networks.These tests encompassed texture similarity, structural integrity, and perceptual reality, enabling a comprehensive evaluation and comparison of the networks.The primary objective of these tests was to gauge the realism of the reconstructed OCT images as perceived by human experts, in comparison to the averaged OCT image.Furthermore, the preservation of crucial image content, such as blood vessels, macula boundary, and retinal layers, was also assessed.Thus, the design of the three evaluation tests was formulated with these objectives in mind: • Test 1 -Texture similarity With reference to the averaged OCT image (labelled for the observer), choose the top three reconstructed OCT images that exhibits the most similar texture to the averaged OCT image (gold standard).
• Test 2 -Structural integrity With reference to the original noisy image (labelled for the observer), pick the top three reconstructed images that best preserved the content (detail) of the noisy image.
• Test 3 -Perceptual reality Label the reconstructed OCT images with "Synthetically generated" or "Perceptually realistic" to evaluate the "fakeness" or "realness" of the reconstructed images.
Thirty randomly selected sets out of 200 sets of 9 images (the averaged, noisy and the corresponding reconstructed images from the 7 trained GAN networks) were used for this experiment.A web-based dashboard to receive the expert user's feedback was designed.The system would display the thirty records to the observer, which was blind to the image condition, except where a reference image was used.The user can open each record to view the 9 OCT images.In Test 1, the averaged image is labelled as the reference to the user.The same set of images were presented to all image graders as well as for all three test cases.

Quantitative results
Table 2 presents the averaged quantitative results obtained from our experiments.The table showcases various evaluation metrics, including PSNR-B (peak signal-to-noise ratio of the background), PSNR-All (peak signal-to-noise ratio of all image), SSIM-B (structural similarity index of the background), SSIM-All (structural similarity index of all image), EPI (edge preservation), ENL (equivalent number of looks), TP (texture preservation), and JNB (just noticeable blur).The findings emphasize the effectiveness of using PatchGAN discriminator in conjunction with the ResUNet and UNet models, especially when trained with MSE loss, for achieving favourable quantitative results.Note the values marked in bold represent the highest scores in their respective categories.Additionally, we analyze the Edge Preservation Index (EPI) along the vertical direction at each boundary of retinal layers.Figure 5 provides a comparative visualization of EPIs across the seven networks.Notably, SiameseGAN exhibits the highest EPI values near the ISOS and RPE layers, indicating superior edge preservation in those regions.On the other hand, ResNet-WGAN-Perceptual demonstrates the highest EPI scores around the IPL, OPL, and ELM layers.Furthermore, we evaluate the perceived sharpness of all seven networks using the S3 measure, as depicted in Fig. 6.Perceived sharpness, captured by the S3 (spectral and spatial sharpness) method, is a metric that produces a perceptual sharpness map.In this map, higher values indicate regions perceived as sharper by human visual system.The overall sharpness is estimated by identifying the sharpest region in the image, corresponding to the maximum value in the sharpness map.The synthesis of the S3 value offers quantitative assessment of the overall perceived sharpness of the entire image.
The diagram clearly illustrates that SiameseGAN achieves the highest S3 score, surpassing UNet-WGAN-Perceptual by more than 0.1 in perceived sharpness.Both UNet-PatchGAN and ResUNet-PatchGAN achieve S3 scores above 0.5, while the averaged OCT image achieves a perceived sharpness score above 0.3.In each record, the original noisy and the corresponding averaged image were labelled for the user.The user would then pick either "real" or "fake" label for the reconstructed images through seven trained networks.that the averaged OCT image was not selected 20% of the time among the top three choices.According to Fig. 8, UNet-PatchGAN-MSE received the highest votes, competing with averaged OCT image.SiameseGAN was mostly picked as second and third choice.

Discussion
We have investigated the effect of texture loss of the PatchGAN discriminator on the de-noising performance of GAN networks for OCT speckle noise reduction.We performed an ablation study on generators (UNet and ResUNet) and discriminators (PatchGAN and WGAN).Additionally, different loss functions were considered, given that the PatchGAN discriminator loss typically uses a MSE, while the WGAN discriminator has been used with both MSE and perceptual losses.To understand the influence of each module in reconstructing OCT images an experiment including six networks was performed in this study.Additionally, the study also compares these networks with SiameseGAN (the seventh network) which consists of a ResUNet generator, WGAN discriminator with perceptual loss and is coupled with a twin Siamese network as an additional adversarial loss.We used a range of quantitative measures (PSNR-B, PSNR-ALL, SSIM-B, SSIM-All, EPI, ENL, TP, JNB, and S3) to evaluate the reconstructed OCT images.In addition, we gathered feedback from six clinical experts through three scenarios where they have to categorize based on the style (Test 1), content (Test 2) and perceptual realness (Test 3) of the OCT reconstructed images.
Our findings using real clinical images indicated that the texture loss of PatchGAN can effectively optimize UNet generator with MSE adversarial loss to generate OCT reconstructed images that were superior to SiameseGAN score for a number of metrics, including: PSNR-B (32.50dB compared to 31.02dB ), PSNR-All (44.48dB compared to 33.76dB), SSIM-B (0.88 compared to 0.85), SSIM-All (0.97 compared to 0.89) while also scoring high on EPI (0.90 compared to 0.77).These quantitative findings also agree with the qualitative assessment.According to imaging experts' feedback, UNet-PatchGAN-MSE network was chosen 100% of the time as the top three choice, whereas SiameseGAN was chosen as the top three in only 63% of the time.According to the users' feedback, SiameseGAN is prone to introduce artifacts in the reconstructed OCT images that compromises the structural integrity of the retina.Similarly, PatchGAN can effectively optimize the ResUNet generator with MSE adversarial loss to generate OCT reconstructed images that scored higher compared to SiameseGAN in PSNR-B (32.88dB compared to 31.02dB),PSNR-All (44.13dB compared to 33.76dB), SSIM-B (0.91 compared to 0.85), SSIM-All (0.98 compared to 0.89), and EPI (0.93 compared to 0.77).However, ResUNet-PatchGAN-MSE was not a popular pick in comparison with SiameseGAN and UNet-PatchGAN-MSE networks.Clinicians expressed dissatisfaction with the images due to excessively blurred textures, which resulted in a perceived fading of retinal structures and diminished perceptual clarity.Our findings underscore the critical role of visual evaluation in guiding the selection of image reconstruction methodologies.The unanimous preference for the UNet-PatchGAN-MSE network by imaging experts, coupled with their reservation regarding artifacts introduced by SiameseGAN, highlights the necessity of incorporating qualitative feedback (visual assessment) in the assessment of proposed methods.As we look towards future research, the integration of more extensive clinical datasets and the active involvement of domain experts in the evaluation process will be paramount.Furthermore, while PatchGAN proves effective in optimizing both UNet and ResUNet generators, the relatively lower popularity of ResUNet-PatchGAN-MSE among imaging experts prompts intriguing questions about the interplay between network architectures and visual preferences.This observation emphasizes the need for further exploration into the interaction of different generator-discriminator architectures and their impact on visual acceptability.For future research, a holistic approach integrating quantitative metrics, qualitative assessments, and, most importantly, clinical expert feedback will be crucial.The comprehensive evaluation strategy ensures that the developed methodologies not only excel numerical benchmarks but also align with the nuanced requirements and preferences of practitioners.As we delve deeper into refining image reconstruction techniques, a focus on bridging the gap between quantitative benchmarks and clinical applicability will remain a central theme in our research agenda.
UNet and ResUNet generators coupled with WGAN discriminator optimised with MSE exhibited similar quantitative measures in PSNR-B (32.60dB and 32.96dB), PSNR-All (44.66dB and 44.60dB), SSIM-B (0.89 and 0.90), SSIM-All (0.97 both), EPI (0.91 both), ENL (1.15 both), TP (0.97 both), and JNB (4.90 and 4.70).However, according to imaging experts' feedback, UNet-WGAN-MSE network was picked as top three image of choice 22% of the time, whereas ResUNet-WGAN-MSE was never picked as top three on texture similarity to averaged OCT in Test 1. UNet-WGAN-MSE (21%) was scored almost 10% more than ResUNet-WGAN-MSE (12%) in structural integrity of Test 2. UNet-WGAN-MSE was voted 48% in terms of realism, 21% more than ResUNet-WGAN-MSE with 27% votes.Further analysis of expert's feedback revealed a consistent observation: both networks exhibited a certain degree of blurriness in the reconstructed OCT images.However, the ResNet-WGAN-MSE network demonstrated a more pronounced blurry effect, suggesting limitations in capturing finer retinal structural and layer details compared to its UNet counterpart.The UNet-WGAN-MSE network, on the other hand, exhibited a higher capacity to reconstruct images with greater fidelity to retinal structures.While quantitative measures provide valuable feedback, the distinct preference for UNet-WGAN-MSE highlights that the perceived quality of reconstructed images extends beyond numerical benchmarks.The ability to capture fine details and nuances in retinal structures, as noted by the experts, underscores the importance of qualitative assessments in refining image reconstruction methodologies for clinical applications.
It is worth noting that the dataset used in this study only included OCT images of healthy individuals.Future work should explore the potential of the proposed method on OCT datasets with images that present pathologies.Similarly, the method was only tested on data from a single OCT instrument.The method's extension to other OCT devices should also be explored.It is likely that the proposed network could serve as a pre-trained network that can be fine-tuned for other OCT devices.

Conclusion
In conclusion, this study has presented a robust methodology for speckle noise reduction in OCT images through image reconstruction, leveraging the capabilities of conditional Generative Adversarial Networks (cGAN).The architecture involved a UNet generator with skip connections, coupled with a PatchGAN discriminator.A comprehensive exploration of the discriminator and generator architectures, along with the impact of different training loss functions, was conducted.The dataset, comprising 1660 B-scan OCT images, enabled a thorough investigation, with 1460 image pairs dedicated to training and the remaining 200 pairs equally divided between validation and testing.
The experimental outcomes revealed notable improvements in key metrics, including PSNR, TP, SSIM, and EPI, when compared to a state-of-the-art method used for speckle reduction, Siamese-GAN.These quantitative advancements were typically supported by qualitative assessments, as indicated by the favorable feedback from clinical experts.Notably, the UNet-PatchGAN-MSE configuration emerged as the preferred choice among imaging experts, demonstrating its efficacy in achieving enhanced image fidelity and noise reduction.
However, the nuanced preferences and reservations expressed by clinicians, particularly regarding the blurriness associated with ResUNet-WGAN-MSE, emphasize the importance of a holistic approach while assessing the outcomes produced by these networks.While quantitative benchmarks provide valuable insights, the incorporation of expert feedback has proven indispensable in better understanding the potential of image reconstruction methodologies for real-world clinical applications.The unanimous preference for the UNet-PatchGAN-MSE network, coupled with the acknowledgment of artifacts introduced by alternative methods like Siamese-GAN, underscores the necessity of incorporating visual feedback in the development process and subsequent assessment.
As we chart a course for future research, it becomes evident that the interplay between different generator-discriminator architectures and their impact on clinical acceptability demands further exploration.The comprehensive evaluation strategy, integrating quantitative metrics, qualitative assessments, and expert feedback, will remain pivotal.The imperative of bridging the gap between quantitative benchmarks and clinical applicability will continue to guide our research agenda.Future endeavours should focus on the integration of more extensive clinical datasets and the active participation of domain experts in the evaluation process, ensuring that the developed methodologies not only excel in numerical benchmarks but also align with the nuanced requirements and preferences of medical practitioners.
In summary, the presented methodology stands as a promising advancement in OCT image reconstruction, demonstrating its potential for clinical utility.Additionally, the collaboration between quantitative evaluations and clinical feedback serves as a paradigm for future research in medical image processing, where the success of a methodology lies in its ability to meet the multifaceted demands of real-world clinical scenarios.

Fig. 1 .
Fig. 1.Lightweight UNet used in pix2pix, the architecture contains 54 million parameters that need to be trained."f" represents number of filters, "k" represents kernel size, and "s" represents number of strides in each layer.E1 stands for encoder type 1, and E2 stands for encoder type 2. Similarly, D1 stands for decoder type 1 and D2 stands for decoder type 2. E1,E2,D1,D2, and bottleneck, displayed on the right side, are the building blocks of the lightweight UNet architecture.The diagram displays the size of the input and output feature maps for each component.

Fig. 2 .
Fig. 2. ResUNet and its components, this architecture contains 19 million parameters that need to be trained."f" represents number of filters, "k" represents kernel size, and "s" represents number of strides in each layer.Residual Block is the building component of the ResUNet architecture.The detail of this component is displayed on the right hand side of the diagram.The diagram displays the size of the input and output feature maps for each component.

Fig. 3 .
Fig. 3.The comparison of PatchGAN and WGAN concept, architecture, and network modules.The PatchGAN discriminates on image patches whereas WGAN discriminates on the whole image as either "real" or "fake".The PatchGAN input is noisy and averaged OCT image pairs, whereas the WGAN input is only the noisy OCT image.Both discriminators have similar architecture, with the same layers.However, there are two main differences that make each network unique.The first is the input layer, which is responsible for taking in the data.The second difference is the final layer, which is responsible for producing the output.These small design choices have a significant impact on the two network's functionality and purpose.

Fig. 4 .
Fig. 4. Original OCT B-scan image (A) and the corresponding averaged OCT B-scan (B).

Fig. 5 .
Fig. 5. (a) an averaged B-scan OCT image with marked retinal layer boundaries, and (b) edge preserving index (EPI) for test OCT images around retinal layer boundaries.

Fig. 6 .
Fig. 6.The slope parameter α of each trained GAN network and their relative positions on the perceived sharpness measure spectrum.

Fig. 7 .
Fig. 7. Left graph -violin plot illustrating clinical observer classification on the top three images that exhibit the most similar texture to the averaged OCT image.The white circle indicates the median of the votes, the black thick bar shows the interquartile range.Right side -screenshot of the image grading tool displaying one typical record containing noisy OCT, its corresponding averaged image, and the reconstructed images through seven networks.

Fig. 8 .
Fig. 8. Left graph -violin plot illustrating clinical observer classification on the top three images that exhibit the most structural integrity compared to the original noisy image.The white circle indicates the median of the votes, the black thick bar shows the interquartile range.Right side -screenshot of the image grading tool displaying one typical record containing noisy OCT, its corresponding averaged image (labelled for the user), and the associated reconstructed OCT images through seven networks.

Fig. 9 .
Fig. 9.Each column shows the same cross-sectional example displaying the averaged OCT image, the reconstructed OCT images from SiameseGAN network, and UNet-PatchGAN-MSE network.On the left side, the yellow arrow highlights the ambiguity in the ILM layer due to failure in the image registration process.This figure shows that the networks were successful in reconstructing the OCT image, regardless of the averaged image.On the right, the yellow arrow highlights the introduced artifact by the SiameseGAN network.

Fig. 10 .
Fig. 10.Right graph -violin plot illustrating clinical observer classification on the perceptual reality of the reconstructed images.The white circle indicates the median of the votes, the black thick bar shows the interquartile range.Left side -screenshot of the image grading tool displaying one typical record.In each record, the original noisy and the corresponding averaged image were labelled for the user.The user would then pick either "real" or "fake" label for the reconstructed images through seven trained networks.

Fig. 11 .
Fig. 11.The left column displays the averaged OCT image and the reconstructed OCT images from ResUNet-WGAN-MSE network, and UNet-PatchGAN-MSE network.The yellow arrow shows the introduced artifact in the reconstructed image by ResUNet-WGAN-MSE network.The right column displays the averaged OCT image, and the reconstructed OCT image from ResUNet-WGAN-Perceptual and UNet-PatchGAN-MSE networks.The yellow arrow highlights the introduced artifact introduced by the ResUNet-WGAN-Perceptual network.
E x∼p data (x) [logD(x)] term represents the expectation over the real data distribution p data (x).