Unpaired intra-operative OCT (iOCT) video super-resolution with contrastive learning

Regenerative therapies show promise in reversing sight loss caused by degenerative eye diseases. Their precise subretinal delivery can be facilitated by robotic systems alongside with Intra-operative Optical Coherence Tomography (iOCT). However, iOCT’s real-time retinal layer information is compromised by inferior image quality. To address this limitation, we introduce an unpaired video super-resolution methodology for iOCT quality enhancement. A recurrent network is proposed to leverage temporal information from iOCT sequences, and spatial information from pre-operatively acquired OCT images. Additionally, a patchwise contrastive loss enables unpaired super-resolution. Extensive quantitative analysis demonstrates that our approach outperforms existing state-of-the-art iOCT super-resolution models. Furthermore, ablation studies showcase the importance of temporal aggregation and contrastive loss in elevating iOCT quality. A qualitative study involving expert clinicians also confirms this improvement. The comprehensive evaluation demonstrates our method’s potential to enhance the iOCT image quality, thereby facilitating successful guidance for regenerative therapies.


Introduction
Regenerative therapies [1,2] are considered promising treatments for degenerative eye diseases such as Age-Related Macular Degeneration (AMD) [3].AMD is a leading cause of visual impairment for elderly individuals in developed countries [4].The effectiveness of these therapies heavily relies on the precise delivery of novel therapeutic agents, either sub-retinally or intraretinally.Consequently, robotic systems accompanied by exceptional visualization capabilities may provide the level of precision required during the implantation [5].Intra-operative Optical Coherence Tomography (iOCT) can significantly assist such vitreoretinal surgeries by capturing cross-sectional images of the retina to guide therapy delivery.
In the pre-operative setting, Optical Coherence Tomography (OCT) is widely used to noninvasively visualize distinctive retina layers offering valuable diagnostic capabilities.OCT systems use light in the near-infrared spectral range to capture cross sectional images and subsequently apply spatiotemporal signal averaging to provide OCT images of high quality.This enables clinicians to effectively differentiate between various retinal layers and pathologies.However, the long acquisition time of pre-operative OCT scans renders them unsuitable for visualization of real-time interventions.On the other hand, iOCT provides the advantage of real-time acquisition of cross sectional retinal images in 2D and 3D with micrometer resolution.Studies indicate that iOCT can assist the precise delivery of the therapeutics in the subretinal space [6].Nonetheless, this real-time acquisition is provided at the expense of compromised image quality as evidenced by high levels of speckle noise [7] and low signal strength [8], thereby limiting its interventional utility.The purpose of our work is to augment the capabilities of the current iOCT systems by computationally enhancing the iOCT image quality without requiring costly hardware upgrade.
Several approaches have been proposed for OCT image quality enhancement including diffusion-based [9], registration-based [10] and segmentation-based [11] methods along with Wiener filters [8], non-local filters [12] and sparse coding [13].These methods can successfully enhance the OCT image quality by augmenting retinal layer spatial details and mitigating speckle noise, while preserving the image content.However, they rely solely on information from a single iOCT image without employing a learning-based mapping to a high-quality domain.This strategy might not be sufficient for simultaneous noise reduction and generation of high-frequency information and finer details which are vital for enhancing the overall quality of the denoised images.Furthermore, the computational cost and the long scan acquisition time prohibits the use of similar approaches for real-time iOCT quality enhancement."Super-resolution" and "quality enhancement" are used interchangeably as usual in the literature.
Generative models such as Generative Adversarial Networks (GANs) [14] have been successfully used in image quality enhancement or image domain translation tasks using natural image datasets [15][16][17].Moreover, these models have been applied to various medical image modalities such as PET [18], CT [19], OCTA [20,21], OCT [22][23][24] and iOCT [25][26][27].However, many of these methods [20,21,24] artificially generate paired LR and HR datasets by simulating the degradation process with conventional filters.It is important to acknowledge that the superresolution challenge we address in our problem involves a more substantial gap between LR and HR domains.This is because iOCT images are acquired in real-time during surgical procedures, which can lead to dynamic changes in tissue conditions, lighting variations, motion artifacts and high noise levels.These factors can introduce substantial variations in image appearance and quality creating a domain that can not be easily simulated by conventional filters.Also, iOCT images exhibit surgical instruments interactions, a characteristic absent in pre-operative OCT diagnostic images.Furthermore, the existing methods focus on single-image quality enhancement disregarding temporal consistency and thereby leading to inhomogeneous enhancement when applied in videos.While numerous works have been proposed for video quality enhancement [28], they necessitate well-aligned input and target videos, a condition which often remains unmet in medical imaging domain, particularly in iOCT.
This research explores unpaired iOCT video super-resolution with the objective of improving image quality of low-resolution (LR) iOCT video frames by leveraging information from highresolution (HR) pre-operatively acquired OCT images.We propose an adversarial framework which contains a bidirectional recurrent neural network (VSR model) as generator to ensure temporal consistency, trained with a multilayer, patchwise contrastive loss to enable superresolution between unpaired LR and HR datasets.Our VSR model adopts critical components from state-of-the-art VSR models such as BasicVSR [29] for efficient alignment and fusion of temporal information.The vast majority of the proposed VSR models (including BasicVSR) are trained on artificially paired video datasets using pixel-level supervision.As simultaneous acquisition of real LH and HR iOCT videos is not possible, we ease the requirement of pixel-level supervision by using contrastive loss.Patchwise Contrastive Learning [30] enables unpaired super-resolution by preserving the content using maximization of the mutual information between corresponding LR and generated HR patches.To establish the effectiveness of the proposed framework we provide extensive quantitative analysis showcasing our model's superior performance against current state-of-the art iOCT super-resolution techniques.Additionally, to further support our design choices, we conducted ablation studies, revealing that the aggregation of temporal information and the use of multilayer patchwise contrastive loss play crucial roles in image quality enhancement and structure preservation, respectively.Therefore, our contributions can be summarized as: First, we have successfully achieved a new state-of-the-art performance in iOCT super-resolution, thereby augmenting the capabilities of this imaging modality.Second, we propose a novel methodology for training VSR models in unpaired setting enabling the application of these models in domains where paired datasets are limited, such as the medical imaging domain.Also, to facilitate the super-resolution research in Optical Coherence Tomography and to provide the broader research community with a valuable resource for experimentation and further advancements, our source code will be available online at: https://github.com/RViMLab/BOE2023-iOCT-video-super-resolution.

Methods
In this section, we present the clinical data utilized in our study and we introduce our proposed unpaired video super-resolution framework.We outline its design and highlight its key features, providing implementation details for a comprehensive understanding of the methodology.

Datasets
We used an internal database which contains intra-operative videos and pre-operative OCT data accompanied with vitreoretinal surgery videos (see Fig. 1).The dataset encompasses a diverse cohort of 61 patients presenting various pathologies including: Macular hole, Epiretinal membrane, Optic disc pit maculopathy, Floaters, Choroideremia, Myopic foveoschisis, Vitreomacular traction and Synchisis Scintillans.The data acquisition process adhered to the principles outlined in the Declaration of Helsinki (1983 Revision).The intra-operative OCT (iOCT) video sequences used in our research consist of 5 frames per sequence, with a resolution of 304×448 pixels.Each iOCT frame has a variable width of 3-16 mm and height of 2mm.These sequences were acquired during the vitreoretinal surgeries using RESCAN 700 mounted on the Zeiss OPMI LUMERA 700.The patients' pre-operative OCT (preOCT) scans (resolution of 512×1024×128 voxels), acquired using a Cirrus 5000, undergo a process of extraction, leading to the creation of 128 two-dimensional (2D) images.Each 2D image has width of 6mm and height of 2mm.The extracted 2D images subsequently serve as constituents of the high-resolution (HR) domain.The dataset, 9676 iOCT video sequences and 7808 preOCT images, was split randomly into: training set (43 patients, 70%), test set (9 patients, 15%) and validation set (9 patients, 15%).The data of each patient was used only in one set.The various pathologies present in the dataset are representatively distributed between both the training and test sets, thus maintaining a satisfactory level of pathological diversity across the partitions.Please note that the data can be provided to interested researchers via a data-sharing agreement, while there are parallel endeavours to make it public.

Unpaired video super-resolution
The task addressed in this work is unpaired video super-resolution of iOCT.Given a LR iOCT video sequence x ∈ X ⊂ R T×H×W×C , with T being its temporal extent, H and W being height and width respectively, C denoting its channels (C = 1) and each frame denoted by x t (of shape H × W × 1), the goals of our video super-resolution approach are three-fold: First, ∀x t we aim to generate a high resolution image ˆ︁ y t with the appearance and image quality of the target domain Y ⊂ R H×W×1 .Second, temporal information from previous input frames {ˆ︁ x t−N , . . .,ˆ︁ x t−1 }, within a certain time window of size N<T, must be efficiently incorporated to generate image ˆ︁ y t of enhanced quality.Third, ˆ︁ y t must retain the structural information contained in the original LR iOCT input x t .We now discuss how each of those objectives is addressed by the proposed approach.

Appearance mapping using adversarial training
To enforce that the appearance and image quality of the HR domain Y is transferred to a LR input video frame x t , we resort to adversarial training, given its established effectiveness for domain mapping problems [16].Our model consists of a generator G, shown in Fig. 2, and a PatchGAN discriminator D [17] for which the following standard adversarial objective is optimized: By minimizing this loss, G learns to generate video frames ˆ︁ y t that are indistinguishable by D from real images y ∈ Y.It is also noted that the generator in the above equation receives as input sequences of frames x ∈ X and the discriminator operates over single frames.This choice is discussed next.

Incorporating temporal information in the generator
To incorporate temporal information we condition the generation of ˆ︁ y t on a time window of the input video sequence x.More specifically, given a sequence of N consecutive LR iOCT frames x [t−N:t] , we obtain the corresponding super-resolved frame as ˆ︁ y t = G(x [t−N:t] ).In this work, G (shown in Fig. 2) is a typical bidirectional recurrent network that adopts essential components from the foundational work of BasicVSR [29].Recurrent VSR networks, as BasicVSR, have showcased excellent performance in natural video super-resolution by spatially increasing the input's pixel resolution (by a factor of: 2 or 4), thus we integrate several key modules from bidirectional propagation, flow estimation and warping modules within our proposed framework, to incorporate the temporal information.Importantly, while [29] proposes these architectural designs when pixel-level supervision is available, we explore their applicability and importance in the more challenging task where no such paired data are available and the former also includes a domain mapping component (Sec.2.2).
First, the input sequence x [t−N:t] is downsampled and subsequently passed through the backward propagation branch.Backward propagation includes optical flow estimation, spatial warping and reconstruction modules.Particularly, as shown in Fig. 3 (left), during the backward propagation, optical flow is estimated between x t+1 and x t and is utilized for spatial alignment by performing warping on the propagated features h b t+1 from the adjacent t + 1 frame.The aligned features are then passed through multiple residual blocks (reconstruction module) for refinement as commonly used in super-resolution networks [15].The equations that describe the functioning of the backward branch are the following: where OF, Warp, Rec denote optical flow estimation, warping and reconstruction modules respectively.b refers to backward propagation and h pertains to the propagated features.Subsequently, the downsampled x [t−N:t] sequence is fed into the forward propagation branch (see Fig. 3 (right)) which contains similar components as those in backward propagation.However, in the forward branch, the input sequence is utilized in reverse order.Similar set of equations applies to the forward case: where f refers to forward branch.Finally, we perform aggregation and upsampling operations to the outputs obtained from both backward and forward branches to generate a HR output, denoted ˆ︁ y t , for every x t : where Aggr and Up pertain to the Aggregation and Upsampling modules.

Preserving structural information
The adversarial loss alone does not guarantee that structural information in x t , such as the exact location and shape of surgically targeted retinal layers is preserved during this mapping.Thus, we require an additional constraint that explicitly enforces spatial and structural consistency between x t and ˆ︁ y t .
To this end, we apply a multi-layer patch-wise contrastive loss [30] that enforces that the local and global spatial structure of the input frame x t is preserved in ˆ︁ y t by maximizing the mutual information between corresponding input and generated patches.
To obtain such features from a common embedding space, both the input video x and the generated images ˆ︁ y are passed through the generator to extract features from multiple layers indexed by L = {l 0 , l 1 , l 2 , l 3 , l 4 } (denoted by black lines in Fig. 4), comprising both early highresolution layers that encode local information and deeper lower-resolution layers that produce more contextualized features.
Then, for each layer l, we randomly sample a subset of the features extracted by the generator from x [t−N:t] denoted as {u i } l .We also select their spatially corresponding counterparts {v i } l from features extracted by the generator from ˆ︁ y t and treat them as the positive samples of {u i } l .Finally, as negatives we sample a subset of the features extracted by the generator from x [t−N:t] from different spatial locations than those of the two previous sets, denoted as {n i } l .Then the contrastive loss is defined as: where To further encourage structure consistency between input and output, we use a perceptual loss [15] L Perc .Our complete objective then becomes: where λ 1 , λ 2 and λ 3 denote the weights assigned to L GAN , L Contr and L Perc losses respectively.

Implementation details
To train the recurrent VSR model, we used a strided 1×1 convolution for downsampling and the pre-trained SPyNet [31] as optical flow estimation module.For the reconstruction module, we used 10 residual blocks of 128 feature channel with instance normalization [32] and ReLU non-linearities.For aggregation, we used feature concatenation, and for upsampling pixel-shuffle [33] and hyperbolic tangent function (tanh) as last layer in the generator to produce output between [−1, 1].
The learning rate for all the modules is set to 10 −4 adopting linear decay policy after 50 epochs.Our model is trained using Adam optimizer and a batch size of 8 for a total of 200 epochs.We used NVIDIA Tesla V100 SXM3 32 GB for our experiments.
In order to calculate the multi-layer, patch-based contrastive loss, we extract features from 5 layers: the grayscale pixels, the downsampling convolution, the first convolution, the second and the fourth residual block of the reconstruction module.This approach aligns with the application of the contrastive loss, as outlined in the foundational work [30], as well as the utilization of the VGG loss, as demonstrated in [15].Both methods employ layers ranging from pixel-level up to a deep convolutional layer.We base our approach on the same rationale.
Furthermore, we choose λ 1 = 1, λ 2 = 20 and λ 3 = 1.The coefficients were experimentally determined, aligning with the methodology outlined in the original paper [30], wherein they selected λ 1 = 1 and λ 2 = 10.Through our own exploration, we identified that the application of an increased λ 2 , specifically λ 2 = 20, serves to establish higher preservation of structural information between the input and the generated output.

Experiments
In this section, we report the results derived from the quantitative analysis performed to assess the efficacy of our approach.We employed nine different no-reference image quality metrics for evaluation purposes.Additionally, we conducted ablation studies to highlight the key contributions of our methodology.

Quantitative analysis
We conducted an evaluation to assess the image quality improvement between real iOCT video frames and those generated by our iOCT video super-resolution methodology video frames.We utilized a test set comprising a total of 1738 iOCT video sequences, extracted from iOCT vitreoretinal surgery videos of 9 patients, all of whom were not included in the training set.Since aligned ground truth HR video sequences are not available, we used nine different no-reference IQA metrics to extensively assess the image quality.These metrics include Fréchet Inception Distance (FID) [34], Kernel Inception Distance (KID) [35], perceptual loss function ℓ feat [25], Global Contrast Factor (GCF) [36], Fast Noise Estimation (FNE) [37], Precision and Recall (P&R) [38] and Density and Coverage (D&C) [39].To gauge the impact of our SR method, we calculated |∆GCF|, which quantifies the absolute difference in the mean GCF values between SR images and preOCT images as well as |∆FNE|, which measures the absolute difference in the mean FNE values between SR and preOCT images.These evaluation metrics enable a comprehensive assessment of the image quality enhancement achieved by our methodology in the absence of available ground truth HR video sequences.
Table 1 summarizes the results of our quantitative analysis.Our unpaired video super-resolution approach demonstrates superior performance compared to all the other iOCT super-resolution methods, ranking first in six out of nine metrics.We conducted comparisons with the state-ofthe-art [25][26][27] and also with the single-image super-resolution model using contrastive learning (SRCUT) as proposed in the original paper [30].It is important to note that CycleGAN [16], the state-of-the-art two-sided image-to-image translation method, failed to generate images of satisfactory quality.Therefore, we excluded it from the comparisons.Our approach showcases remarkable performance in perceptual metrics such as FID, KID, ℓ feat .Lower FID values indicate lower distance and, consequently, higher similarity between the distributions of the generated SR and real preOCT images in a deep feature domain, encompassing both low-level and high-level features.Therefore, our methodology generates SR iOCT images of the highest quality as they closely resemble the images of preOCT domain which contains high levels of spatial details and diagnostic utility (see also Fig. 5 and Fig. 6 and Fig. 7).Furthermore, the KID values, which offer a similar but more advantageous than FID metric along with the ℓ feat values, indicate highest level of quality enhancement achieved by our method, concerning perceptual quality.As generated images highly resemble preOCT images, they are characterized by finer and high-frequency details avoiding noticeable blurriness and discontinuities in anatomical structures.This is attributed to the establishment of a stable adversarial training, which has shown remarkable effectiveness in domain mapping problems.Moreover, it is noteworthy that the omission of L1 and L2 loss functions from training likely contributed to the reduction of blurriness of the results as they are recognized for their potential to introduce blurriness into the generated output.Moreover, recognising that perceptual metrics may not be able to capture and assess low-level characteristics, we provide an analysis of the |∆GCF| and |∆FNE| values, where our method achieves the lowest (best) and the second lowest values, respectively.This divergence appears reasonable as noise and contrast values do not necessarily depend on each other or have to follow the same trend.Notably, the worse |∆FNE| value of our method could be attributed to the FNE's inclination to identify thin lines as noise, as stated in [37].Nevertheless, our method demonstrate improved |∆GCF| and |∆FNE| values compared to previous methods ensuring that the contrast extracted from various resolution levels (GCF) and the noise levels (FNE) in SR images closely resemble the corresponding levels in the HR preOCT images, which represent the highest quality images in our dataset.
Additionally, to address the limitations of using single-number perceptual metrics, such as FID, in separating fidelity and diversity, both significant aspects of generative models' quality, we utilize precision and recall (P&R) metrics.Our method achieves the second highest (best) values in both metrics indicating that our generated SR iOCT images highly resemble the real preOCT images (P) while covering their full variability (R).
Similarly, we incorporate Density and Coverage (D&C) metrics, an improved version of P&R, in our evaluation.Our method attains the highest D and C values among other SR models, further supporting the capability of our network to generate images with high fidelity and diversity concerning the HR domain.

Human evaluation study
To further validate our video super-resolution pipeline, we performed a human evaluation study.Our survey included 20 pairs of LR and generated through our methodology images, randomly selected from the test set.We asked 6 retinal dOCTors/surgeons to evaluate these image pairs by assigning a score between 1 (strongly disagree) and 5 (strongly agree) on the following questions: The responses, denoted as A1, A2, A3, each accompanied by their respective mean values and standard deviations, demonstrate that the clinicians recognized the improved delineation of RPE vs IS/OS junction (Q1), the absence of artificially hallucinated structures (Q2) and the enhanced delineation of ILM vs RNFL (Q3).Compared to the qualitative results reported in [27], our analysis indicates greater lever of preference and agreement among clinicians regarding important characteristics of our generated images.Across the three questions, we note a substantial elevation in the mean value exceeding 0.5 compared to the results in [27].Visual results are shown in Fig. 5 and Fig. 6, complementing the findings of our survey.

Importance of different loss terms for iOCT VSR
To study the importance of each of the 3 terms of the overall loss function (Eq.( 8)) that was used for training the proposed unsupervised video super-resolution model, we conducted a series of ablation experiments, each evaluating the effect of removing a loss term.The results are presented in Table 2 (Fig. 8).We have omitted the ℓ feat and P&R values due to spatial constraints.Instead, we have included FID and D&C values as they represent similar aspects of the evaluation.Removing the perceptual loss (w/o L Perc ) leads to a moderate drop on all metrics.Removing the (w/o L Contr , λ 3 = 1) contrastive term and keeping the perceptual term's weight in its default value (λ 3 = 1) results in significantly worse super-resolution results, as reported in Table 2.
Furthermore, removing the contrastive loss term and adjusting the weights for the perceptual loss, (w/o L Contr , λ 3 = 5 and w/o L Contr , λ 3 = 10), improves performance but still results in slightly worsen perceptual quality metrics (as measured by FID and KID).Importantly, in these cases the obtained |∆GCF| and |∆FNE| are higher indicating that the contrast and noise levels of the preOCT are better captured by the model when using our complete loss function ("ours").This observation highlights the importance of contrastive loss in maintaining consistent perceptual quality as well as contrast and noise levels.
Finally, removing the adversarial loss term (w/o L GAN ) significantly impairs performance w.r.t to all metrics, hinting that it is crucial for generating images that capture both the high and low-level aspects of the HR preOCT domain.

Importance of aggregating temporal information
In this section, we highlight the importance of using multiple images with feature propagation between neighbouring frames for iOCT super-resolution.We initially perform inference of our model while setting the output of the feature warping module (depicted as the output of blue box denoted as 'Warp' in Fig. 3) equal to zero.The rational behind this ablation is to assess whether the reconstruction module can generate high quality images without utilizing warped features from preceding frames.As reported in Table 3, in the absence of feature propagation from previous frames, the model 'w/o feat_prop' achieves in worse results on all metrics.Furthermore, to explore the effect of aggregating temporal information on the ultimate superresolution outcome, we vary the number of input frames.As shown in Table 3 and 2 nd row in Fig. 9, our model exhibits a gradual enhancement in performance with the inclusion of additional frames in the input sequence particularly evident in perceptual metrics such as FID, KID as well as the D&C metrics.However, concerning |∆GCF| and |∆FNE|, we observe that these values do not exhibit a monotonic improvement with the inclusion of more frames.This observation can be attributed to the fact that our recurrent network was trained on sequences of five frames and has learned to capture and utilize the temporal context within that specific sequence.As a result, during inference, when we reduced the sequence length, we lose some of the contextual information potentially leading to a negative impact on the network's ability to make accurate predictions.We also investigated whether our trained model can effectively handle longer image sequences (ranging from 6 to 9 frames) compared to the different training strategies as shown in Fig. 9.The FID and |∆GCF| values for each method are presented in the graphs 10(a) and 10(b), respectively.In Fig. 10(a), we observe that our method (green curve), utilizing the contrastive loss, consistently generates images of relatively similar quality as indicated by the relatively constant FID values, across different numbers of input frames.On the contrary, when trained without the constrastive loss (magenta curve), we achieved acceptable results (low FID) solely when using image sequence of a length close to the training (5) frames while higher or lower sequence lengths resulted in worse performance (high FID).Similarly, removing the perceptual loss (orange curve), leads to higher FID values degrading the perceptual quality of the generated images across time.Similarly, regarding |∆GCF|, our approach (green) maintains the contrast values at consistent levels across different frames.Analogous trend is observed in the orange curve, which was trained solely with contrastive loss.Conversely, our model trained without contrastive loss exhibited high |∆GCF| values for frames lower than 4 and higher than 7.
These findings further reinforce the efficacy or our methodology, particularly in handling longer input sequences.According to these graphs, our methodology generates consistently images of high quality particularly in handling image sequences of varying lengths while preserving image quality and contrast characteristics.The integration of contrastive loss assists the generator to learn intermediate representations that encapsulate meaningful semantic or structural information, thereby facilitating the aggregation of propagated features and the reconstruction of high-quality images.Particularly for longer input sequences, the generator is able to effectively propagate and aggregate long-term information.On the contrary, our model trained without contrastive loss, exhibits signs of overfitting, as it may have become overly reliant to using the entire 5-image context for predictions and may face challenges when that context is reduced, leading to decline in performance.
At this point, we would like to note that the selection of 5 frames is rooted in our decision to reduce the training duration in comparison to lengthier input sequences, thus affording us the opportunity to undertake an expanded number of experiments to comprehensively assess our work's efficacy.It is important to acknowledge that the adoption of extended image sequences for training purposes to capture long-term information remains a promising avenue for future investigation.

Comparison with denoising methods
In this section, we undertake a comparative analysis of our super-resolution method against conventional denoising filter in order to illustrate the denoising efficacy of our work, which forms part of the broader objective of image quality enhancement.
For evaluation, we selected two state-of-the-art speckle reduction methods for OCT images: Non-local means (NLM) [40] and Block-matching and 3D filtering (BM3D) [41].These methods have been evaluated in previous studies [8,[42][43][44] for their denoising capabilities.The NLM implementation, provided by scikit-image [45], was employed and we conducted tests with two variations: one utilizing estimated noise standard deviation (NLM ( σ)) and one without incorporating such estimation (NLM).Additionally, the BM3D approach was assessed using experimentally defined values of σ = 0.05 and σ = 0.1 as shown in the accompanying Table 4. Based on the data presented in Table 4 (see also Fig. 11), it is evident that denoised images obtained through conventional techniques do not yield images of comparable quality to those in the preOCT domain.Although these filters may indeed succeed in effectively reducing noise, they are unable to furnish valuable information regarding retinal layers and other crucial structures pertinent to vitreoretinal surgeries, a feat accomplished by our super-resolution method.

Conclusion
In this study, we proposed an unpaired video super-resolution approach for intra-operative Optical Coherence Tomography (iOCT) videos acquired during vitreoretinal surgeries.Through extensive quantitative analysis, ablation studies and a qualitative analysis by expert clinicians, we demonstrated that our approach can effectively enhance the iOCT image quality achieving a new state-of-the-art performance.Furthermore, we showed that video super-resolution models when trained with multilayer patchwise contrastive loss can effectively aggregate temporal information even in an unpaired scenario.
Potentially, our approach may be applicable in different intra-operative modalities, such as ultrasound and intra-operative MRI as well as research in 3D/4D OCT.Particularly, in the context of 3D/4D OCT, our work has the potential to surpass the outcomes reported in existing studies [43,46], as we utilize pre-operatively acquired OCT scans as HR domain to train our super-resolution network.These scans represent more than merely denoised OCT data that are used in [43,46]; rather, they have undergone comprehensive offline processing, thereby furnishing scans of high-fidelity information and diagnostic utility.Furthermore, the underlying architecture of our recurrent video super-resolution algorithm, which leverages temporal information from a sequence of images, bears the potential to increase the consistency between the generated outputs of adjacent 2D frames within the 3D OCT and thus contribute to an elevation in the overall consistency of the volumetric representation.
However, we acknowledge that our work has several limitations.First, our method exhibits slower runtime speed compared to existing works that perform single-image super-resolution.In our research, we chose to focus on the architectural innovation and training strategy to address a critical gap in the field of unpaired medical video super-resolution.To the best of our knowledge, this study is the first to address the application of video super resolution between unpaired datasets in the medical imaging domain.Moreover, the absence of pre-operative OCT images that illustrate interactions with surgical instruments, a phenomenon quite common during surgeries and iOCT scans, could potentially impact the performance of our model.Additionally, the unavailability of pixel-wise ground truth data poses constraints on our ability to incorporate spatial and temporal reference-based metrics that could further support the evaluation of our work.
In conclusion, our work can be used as a strong foundation for future research and optimization efforts.Our future work will explore improvements of our method's computational efficiency, incorporation of tool interactions within the pre-operative domain, extension of input sequences capturing long-term dependencies, utilization of additional spatial and temporal metrics and ultimately deployment of our iOCT video super-resolution approach into the clinic.

Fig. 2 .
Fig. 2. The architecture of the Bidirectional recurrent VSR network used as generator in adversarial training.The input LR iOCT sequence x [t−N:t] passes through: downsampling, optical flow, warping, reconstruction, aggregation and upsampling modules both in backward and forward direction generating HR ˆ︁ y [t−N,t] output.

Fig. 4 .
Fig. 4. Multilayer Patchwise Contrastive Loss using part of the video super-resolution generator as encoder.Both x and ˆ︁ y t are encoded in features.l 0 , l 1 , l 2 , l 3 , l 4 are the multiple layers that we extracted features from.The features of a generated output patch should be closer to the features of the corresponding patch in the input image (associating) compared to random patches from different locations (dissociating).

Fig. 6 .
Fig.6.Visual comparison of different SR methods.From top to bottom: LR iOCT images, SR using our UVSR approach, SRCUT and SR using[26].

Fig. 7 .
Fig. 7. Visual results of our proposed method in challenging scenarios encountered during vitreoretinal surgeries.(a) Macular hole: Our method effectively preserves the original structural elements and maintains the interaction between the retina and surgical tools as seen in the input.(b) ERM Peeling: The output distinctly illustrates the presence of epiretinal membrane (ERM) in unpeeled retinal areas, while accurately depicting the absence of ERM in the peeled regions.Remarkably, the model achieves an acceptable level of generalization, even when faced with inputs that lie outside the usual realm of its training and testing data.

Q1:
Can you notice an improvement in the delineation of RPE/Bruchs vs. IS/OS junction in the generated image?(A1: 4.58±0.29)Q2: Do you agree that the generated image does NOT contain artificially hallucinated structures?(A2: 4.52±0.05)Q3: Can you notice an improvement in the delineation of the ILM vs. RNFL in the generated image?(A3: 4.38±0.22)

Fig. 8 .
Fig. 8. Visual comparison between different training strategies.From left to right: LR iOCT images, ours (using both L Contr and L Perc ), ours w/o L Perc and ours w/o L Contr (λ3 = 5).

Fig. 9 .
Fig. 9. Visual comparison between different training strategies across varying input sequence lengths.From left to right: different timestamps of the input sequence.From top to bottom: LR iOCT images, ours (using both L Contr and L Perc ), ours w/o L Contr (λ3 = 5) and ours w/o L Perc .

Fig. 10 .
Fig. 10.Ablation on temporal information: Graphs illustrating the performance of our methodology with contrastive loss (green curve), without contrastive loss (w/o L Contr (λ3 = 5)) (magenta curve) and without perceptual loss (w/o L Perc ) (orange curve) across varying input sequence lengths in terms of image quality metrics.