Reducing image variability across OCT devices with unsupervised unpaired learning for improved segmentation of retina.

Diagnosis and treatment in ophthalmology depend on modern retinal imaging by optical coherence tomography (OCT). The recent staggering results of machine learning in medical imaging have inspired the development of automated segmentation methods to identify and quantify pathological features in OCT scans. These models need to be sensitive to image features defining patterns of interest, while remaining robust to differences in imaging protocols. A dominant factor for such image differences is the type of OCT acquisition device. In this paper, we analyze the ability of recently developed unsupervised unpaired image translations based on cycle consistency losses (cycleGANs) to deal with image variability across different OCT devices (Spectralis and Cirrus). This evaluation was performed on two clinically relevant segmentation tasks in retinal OCT imaging: fluid and photoreceptor layer segmentation. Additionally, a visual Turing test designed to assess the quality of the learned translation models was carried out by a group of 18 participants with different background expertise. Results show that the learned translation models improve the generalization ability of segmentation models to other OCT-vendors/domains not seen during training. Moreover, relationships between model hyper-parameters and the realism as well as the morphological consistency of the generated images could be identified.


Introduction
Optical coherence tomography (OCT) is a non-invasive technique that provides 3D volumes of the retina at a micrometric resolution [1]. Each OCT volume comprises multiple cross-sectional 2D images, or B-scans, each of them composed of 1D columns, or A-scans. By means of the OCT imaging modality, clinicians are allowed to perform detailed ophthalmic examinations for disease diagnosis, assessment and treatment planning. Standard treatment and diagnosis protocols nowadays rely heavily on B-scan images to inform clinical decisions [2].
Several imaging tasks such as segmentation of anatomical structures or classification of pathological cases are successfully addressed by automated image analysis methods. These approaches are commonly based on machine learning (ML) models, which are trained on manually annotated datasets in a supervised setting. Among the existing ML tools, deep learning (DL) techniques based on convolutional neural networks have been remarkably successful in different domains [3], including automated OCT image analysis [4].
However, these models are usually prone to errors when deployed on real clinical scenarios. This is partially due to the differences in the data distributions of the training sets and the real world data sets. This phenomenon is known as covariate shift, as formally defined in [5]. An important example of this phenomenon is observed in practice when the training and deployment different background expertise to comprehensively study the appearance of the resulting synthetic images. To the best of our knowledge, this is the first work that comprehensively explores cycleGAN algorithms in OCT images to reduce image variability across OCT acquisition devices. The main contributions of this paper can be summarised by the following three points: • We empirically demonstrated that unsupervised unpaired algorithms are able to reduce the covariate shift between different OCT acquisition devices. Segmentation performance in particular cases of retinal layer and fluid segmentation showed significant quantitative improvements. These tasks are clinically relevant, and possibly the most active research topics in automated OCT image segmentation.
• We performed an extensive analysis of the effect of the unsupervised unpaired algorithm training patch size in the cross-vendor segmentation tasks. Particularly, we found that optimal patch size might depend on each task, with larger patch sizes more suitable for tasks in which larger structures need to be segmented.
• We performed a visual Turing test with a large set of participants (n=18) with different background expertise. The results show that the image patch size at training stage is an important parameter, controlling the trade-off between "realism" and "morphological fidelity" of the images generated by the translation model. The test demonstrated that translation models trained with larger patch sizes induce a more realistic appearance but were prone to generate artifacts distorting morphological features in the retina. On the other hand, models trained with smaller patch sizes rarely present such distortions, but are easily identified as "fake" images.
This paper is organized as follows. First, Section 2 describes the methods used in this work: the deep learning segmentation model, the cycleGAN and baseline translation algorithms. In Section 3 we describe the used datasets, training details of the models and the experimental evaluation setup. In Section 4, qualitative and quantitative results of the segmentation tasks and the visual Turing test are given. Finally, a discussion of these results and conclusions of the work are presented in Section 5 and Section 6.

Methods
This section presents a summary of the methodologies used in our study. In particular, Section 2.1 describes the unsupervised unpaired translation algorithm based on cycleGANs. Subsequently, Section 2.2 presents a brief description of alternative filtering-based translations that are used as baselines to compare with our proposed approach. Finally, Section 2.3 formalizes the segmentation tasks and describes the neural network architectures used for fluid and photoreceptor layer segmentation.

Unsupervised unpaired translation using cycleGANs
We propose to apply unsupervised unpaired generative adversarial networks to automatically translate OCT images from one vendor to another. Formally, let A and B be two different image domains, where any image a ∈ A has different visual characteristics compared to any other image b ∈ B. Cycle generative adversarial networks (cycleGANs) allow to learn a suitable translation function between the image domains A and B without requiring paired samples, allowing to tackle an ill-posed translation problem (i.e. unpaired translation) [21]. A cycleGAN uses two generator/discriminator pairs (G A→B /D B , G B→A /D A ), which are implemented as deep neural networks. G A→B is supplied with an image from the source domain A and translates it into the target domain B. Analogously, G B→A translates the image back from the target B to the source domain A. D A (D B ) is trained to distinguish between real samples from the source (target) domain and the translated images, being associated to the likelihood that a certain image is sampled from domain A (B). The objective for the mapping function G A−B can be expressed as: where the last two terms are the cycle consistency L cyc and the identity mapping loss L identity , as defined in [21], with weights η 1 and η 2 that control their relevance in the overall loss. Both L cyc and L identity serve as useful regularization terms improving the obtained translation functions G A→B and G B→A . Minimizing L cyc constrains the mentioned translations to be reversible, so successive application of G A→B and G B→A (G B→A and G A→B ) to an image from domain A (B) generates an image that matches the original source image. In contrast, minimizing L identity regularizes the resulting translation G A→B (G B→A ) to generate a target image that is close to the source image, if the source image already have a target B (A) domain appearance. Intuitively speaking, this means that the translation G A→B (G B→A ) is constrained so that changes of source images that are not needed are avoided.
The first two terms L GAN correspond to the least square generative adversarial loss terms [25]: where L GAN (G B→A , D A ), is defined analogously. Both the generator and discriminator were implemented as deep neural networks following the ResNet based architecture presented in [21].

Baseline translation algorithms
Domain translation across images of different acquisition devices is not a common operation in OCT imaging. As such, no baselines are currently available for comparison purposes. However, some preprocessing pipelines including denoising and histogram matching algorithms have been used to standardize the appearance of OCT B-scans [26]. In this work, we followed the methodology applied in [27] to define two suitable translations to approximate the appearance of Spectralis OCTs from Cirrus B-scans. The first translation strategy (T 1 ) consist of an initial B-scan level median filtering operation (with 3 × 3 kernel size) followed by a second median filtering operation across neighboring B-scans, with a 1 × 1 × 3 kernel size. The second translation (T 2 ) consists of an initial histogram matching step using a random Spectralis OCT volume as a template, followed by the same filtering operations as described for T 1 . These strategies were applied as baseline translations when converting images from the Cirrus to Spectralis domain. Translations from Spectralis to Cirrus are not commonly employed in the literature, since Cirrus B-scans have a lower signal-to-noise ratio (SNR).

Segmentation models
A segmentation model f θ with parameters θ aims to output a label y ∈ {1, . . . , k} for each pixel x in an input image I ∈ R h×w (i.e. a B-scan), with k being the maximum number of classes and h and w the height and width of the images in pixels. In general, f is modeled using convolutional neural networks. In that case, the parameters θ are learned in a supervised way from a training set S = {(I (i) , Y (i) )}, 1 ≤ i ≤ n with n pairs of training images I (i) and their corresponding manual annotations Y (i) . This is done by minimizing a pixel-wise loss function J(f θ (I (i) ), Y (i) ) that penalizes the differences between the prediction f θ (I (i) ) and its associated ground truth labeling Y (i) .
In the fluid and photoreceptor segmentation tasks analyzed in this paper, f is a fully convolutional neural network with an encoder-decoder architecture, inspired by the U-Net [28] architecture. The encoder path consists of convolution blocks followed by max-pooling layers, which contracts the input and uses the context for segmentation. The decoder counterpart, on the other hand, performs up-sampling operations followed by convolution blocks, enabling precise localization in combination with skip-connections. For fluid segmentation, a standard U-Net architecture was applied. For photoreceptor segmentation, we took advantage of the recently proposed U2-Net [29], which is to the best of our knowledge the only existing deep learning approach for this task.

Experimental setup
We evaluated the ability of cycleGANs for translating OCT images from one vendor to another in a twofold basis. First, we estimated the generalization ability of models for fluid (i.e. pathology) and photoreceptor layer (i.e. regular anatomy) segmentation on a new unseen domain when translating the input images to resemble those used for training. Secondly, we performed a visual Turing test in which different expert groups analyzed the images produced by the cycleGAN. We additionally asked the experts to identify morphological changes introduced by the cycleGAN-based translations.
The datasets (Section 3.1), deep learning training setups (Section 3.2) and the evaluation of the translation models (Section 3.3) are described next.

Materials
All the OCT volumes used in our experiments were acquired either with Cirrus HD-OCT 400/4000 (Carl Zeiss Meditec, Dublin, CA, USA) or Spectralis (Heidelberg Engineering, GER) devices. Spectralis (Cirrus) devices utilize a scanning superluminescence diode to emit a light beam with a center wavelength of 870nm (840nm), an optical axial resolution of 7µm (5µm) in tissue, and an optical transversal resolution of 14µm (10µm) in tissue. All the scans were centered at the fovea and covered approximately the same physical volume of 2µm × 6µm × 6µm. Cirrus images had a voxel dimension of 1024 × 512 × 128 or 1024 × 200 × 200, and Spectralis volumes had a voxel dimension of 496 × 512 × 49. All Cirrus volumes were resampled using nearest-neighbor interpolation to match the resolution of Spectralis volumes (496 × 512 × 49). As observed in Fig. 2, the B-scans produced by each device differ substantially from each other. In addition to the differences mentioned earlier with respect to the scanning light source, axial and transversal OCT resolution, the Spectralis devices apply a B-scan averaging procedure (by default 16 frames per B-scan). This procedure leads to an improved signal-to-noise ratio (SNR) compared to Cirrus scans. Image data was anonymized and ethics approval was obtained for research use from the ethics committee at the Medical University of Vienna (Vienna, Austria). Photoreceptor Layer Segmentation Dataset Photoreceptor segmentation experiments were performed using two data sets of Cirrus and Spectralis scans, comprising 43 and 50 volumes, respectively. Each Cirrus (Spectralis) volume is composed of 128 (49) B-scans with a resolution of 496 × 512 pixels. All the volumes were acquired from diseased patients. In particular, the Spectralis subset comprised 16 images with DME, 24 with RVO and 10 with intermediate AMD.
The distribution of diseases in the Cirrus subset was approximately uniform, with 16 DME, 17 RVO and 10 early AMD cases. Each volume was manually delineated by trained readers and supervised by a retina expert that corrected the segmentations when needed. The datasets were randomly divided on patient-basis into 1519 (1323), 196 (147) and 739 (637) B-scans used for training, validation and test, respectively, on the Spectralis (Cirrus) set.
Visual Turing Test Dataset For the visual Turing test, we randomly selected a subset of B-scans from the fluid segmentation test set. In particular, B-scans were sampled from randomly selected OCT volumes following a Gaussian distribution, having a mean on the central B-scan (#25) and a standard deviation of 8 B-scans. A total of 90 Spectralis and 90 Cirrus B-scans were selected.

Training setup
The network architectures, training setups and configurations of each deep learning model are summarized in the sequel for each stage of our evaluation pipeline.
Unsupervised Image Translation Model Four different cycleGAN models were trained on the unpaired training dataset using squared patches of sizes of 64 × 64, 128 × 128, 256 × 256, and 460 × 460, respectively finally resulting in four different models (CycleGAN64, CycleGAN128, CycleGAN256, CycleGAN460). Each training phase consisted of 20 epochs using a mini batch size of 1. At each epoch, the deep learning model processes a patch randomly extracted from each B-scan in the training set. Each pair of generator/discriminator was saved after each epoch to subsequently select the best performing model. To address the fact that adversarial losses are unstable during training, we applied the following model selection strategy based on the validation set. First, all 20 generators were used to translate the corresponding (Cirrus or Spectralis) central B-scans of the validation set. Then, the average L GAN (D) term in Eq. 2 was computed for all associated discriminators using the same image sampling order for the target and source image sets. The maximum adversarial loss of the 20 discriminators was then used as the final selection score for each generator. Finally, the generator with the minimum score was selected for a specific patch size configuration. This procedure allowed us to select pairs of generators that were not necessarily paired at the training stage.
Fluid Segmentation Model The fluid segmentation model is an encoder/decoder network inspired by the U-Net architecture. We used five levels of depth, with the number of output channels going from 64 in the first to 1, 024 in the bottleneck layer, in powers of 2. Each convolutional block consisted of two 3 × 3 convolutions, each followed by a batch-normalization layer and a rectified linear unit (ReLU). While 2 × 2 max-pooling was used for downsampling, upsampling was performed using nearest-neighbor interpolation.
We used the negative log-likelihood loss in all our segmentation experiments, Kaiming initialization [30], Adam optimization [31], and a learning rate of 1e −3 , which was decreased by half every 15 epochs. We trained our networks for 80 epochs and selected the model with the best average F 1 -score on the validation set.
Photoreceptor Layer Segmentation Model Photoreceptor segmentation was performed by means of the U2-Net approach described in [29]. Such an architecture allows to retrieve probabilistic segmentations of the region of interest and uncertainty maps highlighting potential areas of pathological morphology and/or errors in the prediction. The core model is inspired by the U-Net, while incorporating dropout with a rate of 0.2 after several convolutional layers. By using dropout in test time, T = 10 Monte Carlo samples were obtained, retrieving the final segmentation as the pixel-wise average of the resulting samples.

Evaluation of the translation model
We evaluated the quality of the translated images obtained by the cycleGAN translation models in two different scenarios. On one hand, we measured the ability of the cycleGAN algorithm to reduce the covariate shift between images from different OCT vendors in automated retinal segmentation tasks. On the other hand, we carried out a visual Turing test to evaluate both the "realism" of the generated images and to identify potential morphological artifacts introduced in the translated version of the scans.
Evaluation via Segmentation Tasks The performance of the segmentation models was assessed in several versions of the B-scans in the corresponding test set. These B-scan versions were obtained by applying one of the following processes: (1) No translation, four different versions of cycleGAN models trained with different image patch size: (2) CycleGAN64, (3) CycleGAN128, (4) CycleGAN256 and (5) CycleGAN460. Additionally, the (6) T1 and (7) T2 translations were only applied as baseline for the Cirrus-to-Spectralis translation (Section 2.2). We evaluated the performance of the segmentation models for the different image versions by computing the Dice score, precision and recall on the test set at voxel level. One-sided Wilcoxon signed-rank tests were performed to test for statistically significant differences.
Evaluation via Visual Turing Test The perceptual evaluation was carried out by a group of 18 participants with a different professional background (6 computer scientists, 6 OCT readers and 6 ophthalmologist), all of them experienced working with OCT images. The graphical user interface was implemented using the jsPsych library [32]. In the first task, the participants were asked to identify the "fake" translated image of a shown pair of original/translated B-scans. For this part of the visual Turing test, one-sided Wilcoxon signed-rank tests were performed to check for statistically significant differences.
In the second task, the participants were asked to identify morphological changes in the retina introduced by the cycleGAN-based translations. The evaluation resulted in a morphology preservation score (MPS) for each original/translated image pair, ranging from 1 (morphological changes detected) to 5 (no morphological changes detected). In particular, the participants were asked to rate their level of agreement (from 1 'strong disagreement' to 5 'strong agreement') with the statement "There are no morphological differences between the original/translated image", meaning that MPS ∈ {1, 2, 3, 4, 5}. If the MPS was 1 or 2, the participants were further asked to select the portions of the image in which they found differences. This second task was carried out only by the ophthalmologist group.
Each participant observed 60 translated/original B-scans pairs (30 original Spectralis and 30 original Cirrus B-scans), randomly sampled from the Visual Turing Test Dataset (Section 3.1). For each sample, randomly either the CycleGAN128 or CycleGAN460 model was used to generate the translated version of the image. For the second task, one-sided Mann-Whitney-U-Tests were performed to determine statistical significance.

Results
While general qualitative results for the Cirrus-to-Spectralis translation are illustrated in Fig. 3, qualitative segmentation results are shown in Figs. 4-5 for both translation directions. Retinal fluid segmentation results are presented in Section 4.1 (Figs. 6-7) and results for retinal photoreceptor layer segmentation are provided in Section 4.2 (Fig. 8). Results of the Visual Turing test are covered in Section 4.3 (Figs. 9-10).     native Cirrus models was lower than of their Spectralis counterparts, with the larger difference for the SRF class.

Fluid segmentation results
For the IRC class, the cross-vendor evaluation of both segmentation models (Spectralis and Cirrus) on the non-translated datasets showed a clear performance drop (Fig. 6), especially on the Cirrus test set (Fig. 6(a-b)). There, all applied translation strategies significantly improved the cross-vendor segmentation model performance with respect to the 'no translation' scenario (p << 0.05). The approach with the highest Dice was the cycleGAN model trained with image patches of 256 × 256 (CycleGAN256). This model showed a significantly better performance both compared to the T 1 and T 2 baseline approaches (p < 0.05). Finally, the tests did not yield a significant difference between the best translation model (CycleGAN460)) and the 'no translation' scenario in the Spectralis test set (Fig. 6(c-d).  We also observed a performance drop in the cross-vendor evaluation of the SRF class when no translation was applied, where this drop was more prominent in the Cirrus test set (Fig. 7(a-b)). There, all the translation models showed an improvement with respect to the scenario without translation (p << 0.05). The best performing cycleGAN model was trained with images patches of 256 × 256 (CycleGAN256), showing a significantly better performance than the T1 and T2 baseline translation algorithms (p < 0.05). Notably, the CycleGAN256 model performed on a par with the upper-bound Cirrus-on-Cirrus model (p = 0.37). In the Spectralis test set (Fig. 7(c-d)), the CycleGAN256 model achieved a higher mean Dice and a lower variance compared to the "no translation" scenario. However, a significant difference between the distributions was not found (p = 0.18).

Photoreceptor layer segmentation results
Results for the photoreceptor layer segmentation task are summarized in Fig. 8. The upper-bound models obtained a Precision, Recall and Dice in the Cirrus (Spectralis) test set of 0.88 (0.89), 0.91 (0.89) and 0.89 (0.90). A drop in Dice was observed for both segmentation models when no translation was used for both translation directions. The decrease was larger in the Cirrus test set (0.90 to 0.56). Moreover, the cross-vendor evaluation showed that all applied translation strategies significantly outperformed the "no translation" scenario in the Cirrus test set (p << 0.05, Fig. 8(a-b)). The best result in terms of Dice was obtained by CycleGAN128, showing a significantly better performance than T1 and T2 (p << 0.05). In the Spectralis test set (Fig. 8(c-d)), the CycleGAN64 based translation achieved the highest Dice (0.88), significantly outperforming the "no translation" scenario (p < 0.05).

Visual Turing test results
The results of the first visual Turing task involving an original/translated image pair (see Section 3.3) are summarized in Fig. 9, showing the percentage of identified "fake" Spectralis B-scans for the Cirrus-to-Spectralis translation in Fig. 9(a-b) and the percentage of identified "fake" Cirrus B-scans for the Spectralis-to-Cirrus translation in Fig. 9(c-d). Note that a value equal to 0.5 would correspond to a scenario in which the participant randomly select any of the B-scans as fake, meaning that the transformed images could not be distinguished from the original scans by the participants.
When evaluating the median of the amount of identified "fake" images in the (CycleGAN128 / CycleGAN460) models across all participants, a difference was found in both the median identification rate of the Cirrus-to-Spectralis (0.97 / 0.73) and the Spectralis-to-Cirrus (0.83 / 0.8) translations. This means that the images that were generated by the CycleGAN460 model were harder to identify than the images generated with the CycleGAN128 model. A paired one-sided Wilcoxon signed-rank significance test found that the difference in both directions was significant (p << 0.05 in the Cirrus-to-Spectralis and p < 0.05 in the Spectralis-to-Cirrus direction). This is also reflected in Fig. 9(b), where for each expert group the generated "fake" Spectralis B-scans of the CycleGAN460 model were harder to identify than generated B-scans of the CycleGAN128 model. However, this effect was less pronounced in the other translation direction (Spectralis-to-Cirrus, Fig. 9(d)). Finally, we can observe that the ophthalmologists showed a lower variance compared to the other expert groups.
Quantitative results of the second visual Turing task (Fig. 10) are showing the distribution of the MPS for both the CycleGAN128 and CycleGAN460 model as well as for both translation directions. For the Cirrus-to-Spectralis translation, the median MPS was significantly higher for CycleGAN128 (median MPS=5) than for CycleGAN460 (median MPS=4), with p << 0.05. The same trend can be observed for the Spectralis-to-Cirrus translation (p < 0.05), although the median MPS was the same for both cycle-GAN models (median MPS=5). In summary, these results clearly indicate that the CycleGAN128 model introduced fewer morphological changes during the translation compared with the CycleGAN460 model, for both translation directions. Exemplary qualitative results for introduced morphological changes are shown in Fig. 11.

Discussion
The main hypothesis of our work was that unsupervised cycleGAN based translation algorithms would allow to significantly reduce the covariate shift phenomenon in automated retinal segmentation models across different OCT acquisition devices. Our results show that these translation approaches indeed allowed deep learning models to improve its generalization ability in cross-vendor OCT images unseen at training stage. In all the evaluated segmentation tasks the effect of the covariate shift phenomenon was larger when using Spectralis models on the Cirrus test set than when using Cirrus models on the Spectralis test set, without any translations (Figs. 6(a-b), 7(a-b) and 8(a-b)). A possible explanation may be that the Spectralis segmentation model could not deal with the lower SNR in Cirrus scans, e.g. due to the noise in the fluid regions appearing much brighter than in Spectralis images. Cirrus models seemed to be more resilient against the covariate shift phenomenon when applied on Spectralis scans, showing a smaller but still significant performance drop compared to the upper bound (Figs. 6(c-d), 7(c-d) and 8(c-d)).
In most of the segmentation tasks and translation directions (Cirrus-to-Spectralis or Spectralisto-Cirrus), we found that the cycleGAN-based translation models trained with image patch size of 256 × 256 improved significantly or slightly -depending on the translation direction and segmentation task -the performance of the segmentation models, with respect to a scenario without any translation. However, each task and translation direction had a different optimal training image patch size. In the fluid segmentation task, in which the objective is to identify hypo-reflective black regions, the best performance was obtained with larger image patch sizes (256 × 256, 460 × 460). In contrast, for the photoreceptor layer segmentation task, in which the aim is to detect a thin layered region, the best performance was obtained with smaller image patch sizes (64 × 64, 128 × 128). One factor explaining these results may be the size of the structures that are segmented. Larger structures seem to require larger training image patches for the cycleGAN to capture the needed contextual information. Thus, the segmentation models may focus more on the global appearance (context) when segmenting larger structures, meaning that in this case it may be more important to reduce the covariate shift effect on a "global appearance level" rather than on a local one. Additionally, using larger patches in the training stage allowed the translation models to generate more realistic images at test time. However, those models were also more likely to introduce image artifacts in the translated B-scans. This was empirically demonstrated in the visual Turing test, as discussed in the next paragraphs.
The first part of the visual Turing test was conducted to evaluate the realism of the generated B-scans. This part of the visual Turing test was inspired by [33], but our set-up constituted a more stringent assessment, since it not only required the participants to judge the "realism" of the generated images but also enabled a direct comparison with the original image. The main finding of this test was that using larger image patch sizes during training of the cycleGANbased models resulted in more "realistic" translated B-scans (Section 3.3). The results of the visual Turing test showed that it was harder to identify the CycleGAN460 generated "fake" images both in Spectralis-to-Cirrus and Cirrus-to-Spectralis directions ( Figs. 9(a,c)). This may be related with the above mentioned theory that larger patches during training allow to learn more complex/realistic translations and reduce the covariate shift effect also on a "global appearance level". When observing the percentage of identified "fake" B-scans stratified by expertise background (Figs. 9(b,d)), we found that the ophthalmologist group seemed to perform consistently in identifying "fake" images generated by the CycleGAN128. This result might indicate that their knowledge about retinal structures and B-scan appearance in general helped them to robustly identify "fake" images. However, this advantage was no longer evident for the CycleGAN460 model. Some of the reported cues the ophthalmologists and OCT readers used for identifying fake B-scans were the quality of choroid tissue, the vitreous border or the smoothness of retinal layer borders. Conversely, participants of the CS OCT researchers group reported relying more on visual cues based on differences in pre-processing operations used by the OCT acquisition devices (i.e quality of the B-scan filtering, B-scan tilt correction. The results of the second part of the visual Turing test illustrate that translation models trained with smaller image patch sizes generate a smaller amount of retinal morphological differences in the translated images. An explanation of this phenomenon might be that training translation models with a smaller patch size results in more conservative (simpler) translations less likely to introduce artifacts (but also less realistic). In particular, the MPS distribution for the CycleGAN460 models (in both directions) had a significantly larger number of decisions rating the image pairs as having strong morphological differences (1 − 3) than that observed for the CycleGAN128 model. This indicates that models trained with larger patch sizes are prone to induce artifacts in the generated images. Additionally, the amount of perceived retinal anatomical differences also depended on the direction of the translation. When translating from Cirrus-to-Spectralis a slightly larger number of morphological differences were identified than in the B-scans translated from Spectralis-to-Cirrus.
We conjecture that translating from a low-SNR to a high-SNR B-scan required the translation models to "invent" information that was not present in the source image, meaning that the lower the signal-to-noise ratio the more likely it is that image artifacts are introduced. Moreover, we empirically observed that the quality of the input images was also a factor for the quality of the translated images. For instance, an extremely low contrast B-scan from any of the OCT vendors would likely generate a low quality B-scan after the translation. An example of such a case is presented in the first row of Fig. 11. We also found qualitatively that a substantial amount of regions highlighted by the ophthalmologists as artifacts were related to the border of the layers delimiting the retina. For instance, changes in the appearance of the bottom layers were observed in the two lower rows of Fig. 11. In the second row, the generated Cirrus B-scan attenuates/removes one of the layers observed in the original Spectralis B-scan. In the third row, the appearance of the retinal pigment epithelial (RPE) layer seemed to be altered in the translated Spectralis B-scan compared to the Cirrus original counterpart.

Conclusion
In this work, we presented an unsupervised unpaired learning strategy using cycleGAN to reduce the image variability across OCT acquisition devices. The method was extensively evaluated in multiple different retinal OCT image segmentation tasks (IRC, SRF, photoreceptor layer) and visual Turing tests. The results show that the translation algorithms improve the performance of the segmentation models on the target datasets coming from a different vendor than the training set, thus effectively reducing the covariate shift (the difference between the target and source domain). This demonstrates the potential of the presented approach to overcome the limitation of existing methods, whose applicability is usually limited to samples that match the training data distribution. In other words, the proposed translation strategy allows to improve the generalizability of segmentation models in OCT imaging. Since automated segmentation methods are expected to be part of routine diagnostic workflows [8] and could affect the therapy of millions of patients, this finding is of particular relevance. Specifically, the presented approach could help to reduce device-specific dependency of DL algorithms and therefore facilitate their deployment on a larger set of OCT devices.
Furthermore, results indicate that the training image patch size was an important factor for the performance of the cycleGAN-based translation model. Larger training image patch sizes usually resulted in models whose generated B-scans were more realistic, i.e., more difficult to identify as "fake" by human observers. However, such images were more likely to contain morphological differences in comparison with the source images. These results indicate that special care should be taken to reduce the likelihood of introducing morphological artifacts. Besides matching the pathological distributions across the domains in the cycleGAN training stage [34], a good practice may be to reduce the complexity of the learned translation by using smaller patch sizes, as there seems to be a trade-off between the "realism" and the quality of the translation models. In this context, future work should be focused on evaluating different architectures and/or loss-functions to address the problem of cross-vendor translation while preserving retinal anatomical features in the OCT image.

Funding
Christian Doppler Research Association; Austrian Federal Ministry for Digital and Economic Affairs; National Foundation for Research, Technology and Development.