Pretraining of 3D image segmentation models for retinal OCT using denoising-based self-supervised learning

Deep learning algorithms have allowed the automation of segmentation for many biomarkers in retinal OCTs, enabling comprehensive clinical research and precise patient monitoring. These segmentation algorithms predominantly rely on supervised training and specialised segmentation networks, such as U-Nets. However, they require segmentation annotations, which are challenging to collect and require specialized expertise. In this paper, we explore leveraging 3D self-supervised learning based on image restoration techniques, that allow to pretrain 3D networks with the aim of improving segmentation performance. We test two methods, based on image restoration and denoising. After pretraining on a large 3D OCT dataset, we evaluate our weights by fine-tuning them on two challenging fluid segmentation datasets utilising different amount of training data. The chosen methods are easy to set up while providing large improvements for fluid segmentation, enabling the reduction of the amount of required annotation or an increase in the performance. Overall, the best results were obtained for denoising-based SSL methods, with higher results on both fluid segmentation datasets as well as faster pretraining durations.


Introduction
Modern imaging modalities such as fundus imaging and optical coherence tomography (OCT) are essential in the field of Ophthalmology, as they allow to precisely study and follow disease progression.OCT in particular enables imaging the cross-section of the retina and obtaining 3D high-resolution volumetric scans.This helps to precisely monitor a series of relevant clinical biomarkers, such as retinal fluid volume or retinal layer thickness, which are crucial for the surveillance of retinal diseases and for the adjustment of patient treatment [1][2][3].Nevertheless, manual measurements of these biomarkers are particularly cumbersome and automation is required.
The current state-of-the-art solutions for automated biomarker quantification are based on Deep Learning (DL), which has already been shown to achieve high performance in the context of OCT analysis [4,5].These models have high modelling capacity and are able to process complex data such as entire 3D OCTs, and they are already used to automatically monitor retinal biomarkers such as retinal fluid [6] or to facilitate the diagnosis of pathologies by detecting subclinical biomarkers [7,8].For instance, for patients suffering from neovascular Age-related macular degeneration (nAMD), DL-based algorithms allow for reliable quantification of different fluid types, and enable improved tracking of the fluid activity [9].Similarly, in diabetic macular edema (DME) patients [10], automated segmentation allowed the investigation of predictive biomarkers in patients undergoing anti-VEGF treatment against nAMD.
However, current segmentation often relies on supervised learning with 2D U-shaped DL networks [11] (U-Nets), whose training requires large annotated datasets.These annotations are difficult to collect because they require expert knowledge and are very costly and cumbersome to produce.
This relative data-inefficiency limits the performance and hinders the implementation of automated segmentation systems for new biomarkers, new imaging devices or new populations.For classification problems, it is easier to find large pretrained networks exploiting extensive public supervised datasets (for instance ImageNet).However, in our case, two limitations arise: the available datasets are far from the target domain (retinal OCTs) and no pretrained weights are available for segmentation networks.A solution to overcome this problem revolves around self-supervised learning (SSL) [12].SSL provides training mechanisms, which do not require annotations, and allow to pretrain networks or learn relevant representations.Indeed, a lot of unannotated OCTs are accumulated in daily clinical care or during clinical studies, and this data can be capitalised on to increase the final performance of segmentation models.

Related work
SSL allows to exploit unlabelled data to pretrain deep learning models or to learn representations and there are multiple strategies to implement it.Namely, a lot of effort has been made recently in the field of contrastive learning.These methods exploit multiple views of the same samples to build robust representations and excel at learning discriminating features.SimCLR [13], MoCo [14] or VICReg [15] are examples of relevant state-of-the-art contrastive-learning methods.However, as they often only pretrain an encoder network and rely on Siamese or more complex structures, which have an important computational overhead, they are less suitable to pretrain models for 3D segmentation.
On the other hand, information-restoration SSL consists in retrieving some information from transformed input samples.The whole input or just some parameters associated with the transformations can be restored.When the whole image or volume input is restored, it allows to train a U-Net, which weights can then be transferred (by fine-tuning) to segmentation tasks.Typical transformations include inpainting, rotation, solving jigsaw or retrieving colourisation [16][17][18][19].Moreover, denoising autoencoders [20] have recently regained attention thanks to diffusion generative models [21], which can be implemented as a set of denoising tasks.The fundamental objective of these methods is to implicitly approximate the underlying distribution generating a dataset.Although interesting, these generative models are usually difficult to train, and require large computing capability to be applied on complex 3D images.In a simplified setting, denoising can also be used to extend pretrained "off the shelf" encoders into full segmentation networks [22].
The rarity and complexity of annotations in medical imaging have encouraged the application of SSL, and it has shown great potential.For instance, with a "Rubik's cube" pretext task in [23], the authors were able to improve brain CT classification and brain MRI segmentation tasks.A transformer-based model, trained as a masked autoencoder [24], was able to improve a large set of classification tasks in [25].In [26], authors used an adapted contrastive method, to pretrain a model, which exhibited improved performance on Dermatology and Chest X-ray classification tasks.Other works have focused on longitudinal modeling to improve prediction tasks, as for instance [27], in which a modified temporally-informed non-contrastive loss was applied.
Additionally, to the applications focused on classification tasks, some works have specifically targeted medical segmentation with self-supervised learning.Most of these methods rely on various forms of image restoration tasks, which allow to pretrain U-Net networks.For instance, in [28]  after multiple patch swaps) and evaluated on brain tumor segmentation, where this pretraining improves performance.Similarly, Model-Genesis [29], extended the image restoration idea by employing multiple corruption methods targeting local and global patterns as well as intensity distributions before performing image restoration, which allowed the improvement of prediction and segmentation tasks on multiple 3D image modalities.They demonstrated SSL pretraining allows to accelerate training time or to reduce amount of annotated data for downstream tasks.With the goal of performing few-shot segmentation of CT and MRI images, the work of [30] successfully devised a self-supervised technique based on superpixel pseudo-labels.Another approach, mixing contrastive learning and restoration SSL [31] allows to pretrain a 3D transformer network to improve segmentation performance on CT and MRI images.However, the contrastive task, the rotation prediction and the inpainting tasks make it a very memory-heavy architecture, which can make experimentation and hyperparameter tuning difficult.Although these works do not cover the retinal OCT modality, they indicate promising applicability for 3D OCT segmentation.
Contribution In this work, we focus on SSL and pretraining models on 3D OCTs to improve downstream segmentation tasks.Specifically, we explore two image-restoration methods based on different corruption tasks.An advantage of these methods is that they provide pretrained weights for segmentation networks (U-Net or encoder-decoder type) and use a single pretraining loss, facilitating the hyperparameter search.Image-restoration also allows us to work on patches which simplifies the processing of 3D OCTs.Therefore, we rely on two SSL methodologies (Fig. 1), the first one, Denoise, is based on an image denoising task and the second one, MultiTask, is based on a set of image reconstruction tasks.Additionally, we explore hybrid approaches where a pretrained encoder is already available, from "off-the-shelf" weights such as from the supervised video dataset Kinetics [32].
We tested these methods on a large in-house 3D OCT dataset (more than 70'000 volumetric OCTs), and for each method, we explored two alternatives: pretraining the entire U-Net, i.e., both the encoder and the decoder, or only the decoder paired with an encoder pretrained with off-the-shelf weights.These four models were evaluated on two challenging datasets from two typical population for retinal fluid segmentation: Choroidal Neovascularization (CNV) and DME groups with an in-house dataset for CNV (Fluid-CNV dataset) and a public one for DME (UMN-DME dataset).

Methods
Our goal is to effectively pretrain deep learning models using SSL before fine-tuning their weights to excel at segmentation tasks.We rely on two SSL methodologies (Fig. 1): Denoise, and MultiTask.In this section, we describe both methods, how their training loss is computed, and the different transformation functions used during training.

Denoise pretraining
The first approach to pretraining is inspired by [22], and consists of a denoising-based selfsupervised method.Unlike the original paper, where authors focus on extending a pretrained ImageNet encoder to a U-Net by adding a decoder and pretraining it with SSL.We pretrain the whole network with the SSL method.The task consists of separating the added Gaussian noise from the original input.By performing this task, the goal is to learn features that can separate the relevant patterns or structures of the OCT from the added noise.
In this approach, the original input OCT x is modified by adding Gaussian noise ϵ, whose scale is controlled by the scalar parameter σ, and the model tries to extract the noise ϵ from the input image by minimizing the L 2 loss between the added noise and the predicted noise, in contrast to predicting the original input as in classical denoising.
We predict noise instead of the original input following a modification inspired by denoising diffusion models, which is shown to improve the results [22].

MultiTask pretraining
This approach is a generalized form of image restoration SSL, based on Model-Genesis [29].It consists of learning to restore an image corrupted with a series of transforms carefully chosen to target different properties in the training images.The general pipeline is the following: Let x be the input image from the unlabelled dataset, x is transformed through a series of corruption methods T i .This corrupted image x i is fed as input to an image-to-image network (U-Net-like) f θ which tries to reconstruct the original input with x r .The network f θ is trained by minimizing the L 2 loss between the original and reconstructed image: The chosen corruption transforms are applied in this order: non-linear intensity change, local voxel shuffling, inpainting and outpainting (Fig. 2).All these transformations allow to learn relevant representations, which are both robust to a wide range of transformations but also capture low-level information, through the ability to perform local reconstruction, and high-level information by restoring the general structure of the retina.The exact values of the parameters are given in the next section (4.4).

Original
Inpainting Outpainting Shuffle NLIS Original Inpaint Multiple 3D boxes from the volume are filled with a random value.The size of the box is chosen randomly within a predefined range and the filling value is selected from a uniform distribution within the image intensity range.Inpainting allows to break large structures present in the retina such as retinal layers, allowing to learn features capturing the general structure of the retina, by preserving structures such as retinal layers.
Outpaint The external areas of the OCT volume are masked, by using a superposition of large volume masks which are inverted to only keep the internal area.The size of the masks is selected randomly within a predefined range.Since the volumes are cropped during pretraining, the transformation aims at removing external context, forcing the representations to capture it from local information.
Local voxel shuffle The voxels are shuffled within a small volume.The size of the volume and its locations are selected randomly within a specified range.The transformation allows to disturb small structures and degrade fine-grained details, this allows the features to encode texture information more robustly, which is helpful to distinguish anomalous areas such as fluid pockets.
Non-linear intensity change The distribution of intensities is remapped using cubic Bézier curves.In our experiments, we limit the shape of the transformation to avoid inverted intensities by fixing two out of four control points (endpoints).Two remaining control parameters are randomly selected, within the intensity range ([I min , I max ] 2 ) for each volume.This transformation increases the invariance to image intensity variability.

Decoder-only pretraining
For each SSL methods (Denoise and MultiTask), we explored an alternative: loading pretrained weights (from Kinetics dataset) in the encoder and freeze it, resulting in pretraining only the decoder on OCTs.Two questions arise: why freezing the encoder and why using non-domainspecific Kinetics weights?First, this solution is inspired by [22] and it can be of value since it allows to accelerate the pretraining and to take benefit from supervised classification datasets if they are available.Second, an ideal solution would be utilizing an encoder from the same domain as the training data but in our case, large supervised 3D OCT datasets were not available so we adopted encoders pretrained on the Kinetics dataset [32], which were available for our encoder architecture.Such models are denoted with the suffix -D in our experiments, and the models which are fully pretrained have -ED suffix.

Downstream tasks
To evaluate the pretraining strategies, we focus on the 3D OCT segmentation of different types of retinal fluids, which is both challenging and practically useful for clinical research and management of patients with exudative retinal diseases.Indeed, retinal fluid can appear in different pathologies, namely in the following prevalent diseases: Choroidal Neovacularization (CNV), Diabetic Macular Edema (DME), or Retinal Vein Occlusion (RVO).CNV is a late stage of Age-related Macula Degeneration (AMD), where leakage of new vessels leads to fluid pockets in the retina.On the other hand, in DME patients, leakages are caused by damage to the blood retinal barrier and are a consequence of Diabetes.Fluid pockets are further classified depending on their position, and we focus on the two main ones: Intraretinal Cystoid Fluid (IRC) and Subretinal Fluid (SRF).We selected two datasets to perform this segmentation of IRC and SRF fluids.

Datasets
The overview of the datasets used in the experiments is provided in Table 1.

OCT-SSL:
The self-supervised dataset denoted OCT-SSL consisting of 71680 OCTs, from 10156 eyes imaged with the Spectralis scanner (Heidelberg Engineering, DE).The dataset is built from data coming from the imaging repository of OPTIMA Lab, consisting of scans from clinical studies or routine clinical care.The vast majority of patients imaged suffered from AMD, 58% with CNV and 29% with geographic atrophy (GA), as well as a small amount of patients with DME and RVO.
All patients gave informed consent prior to inclusion in the respective studies.This retrospective analysis was approved by the Ethics Committee at MedUni Wien (EK Nr: 1246/2016).All study procedures were conducted in accordance with the Declaration of Helsinki, and all the patient data were pseudonymized.The dataset is split into three subsets for training, validation and test, with ratios of 90/5/5%.It was assured that the pretraining dataset does not overlap with the datasets used for the downstream target segmentation task, which are explained next.
Fluid-CNV: This in-house dataset comprises 84 OCT scans from 84 eyes of patients diagnosed with CNV.OCTs have 49 BScans of size (512 x 1024 px).IRC and SRF subtypes were manually annotated separately.The dataset is split into 5 folds at a patient level to perform cross-validation.All OCTs were acquired with a Spectralis scanner (Heidelberg Engineering, DE).
Fluid-CNV reduced dataset: To evaluate the performance of the models with limited annotations, we created a reduced version of the dataset, where the training and validation sets are randomly reduced to 20, 40, 60 and 80% of the original size.The original five test sets from cross-validation are kept unchanged to be able to compare between different settings.

UMN-DME dataset:
We tested our pretrained models on a public dataset: the DME OCT dataset, presented in [33] and published by the University of Minnesota.The dataset consists of 29 OCT scans from 29 patients with DME, where the SRF fluid was annotated on each scan by two expert clinicians.Each scan has 25 BScans of size (496×1024 px).The scans were acquired with a Spectralis scanner (Heidelberg Engineering, DE).We also performed a 5-fold cross-validation at patient level.
UMN-DME reduced dataset: Additionally, we create a version of the dataset with reduced training data, where we kept only 50% of the training and validation folds while the test folds remained unchanged.Given the small number of patients, this reduced version of the dataset is especially difficult to segment in 3D.

Segmentation evaluation metrics
For our downstream segmentation experiments, we evaluated the results using two metrics.The first metric is the Dice Score, which measures the intersection of the ground truth and the prediction over the cumulative areas of both masks.The second metric is Absolute volume Difference (AVD), which is relevant in our case, where the fluid volume is a clinically important biomarker.

Deep learning setup and preprocessing
Preprocessing and OCT flattening: All OCTs, for pretraining or downstream tasks, were flattened along Bruch's membrane using automated segmentation methods [34][35][36] and [37].The vertical dimension is cropped to a height of 224, and the horizontal dimension resized to 512 and all the BScans are kept for pretraining.The vertical crop is extended for segmentation with a window of 256 x 512 pixels in order to capture the whole retina in extreme cases, e.g., retinal swelling due to the presence of large fluid pockets.The cropping window is selected with a fixed offset with respect to the Bruch's membrane.Lastly, the voxel intensities were normalized to a zero mean and unit standard deviation using the statistics of the training set of the OCT-SSL dataset.
Self-supervised learning: the network is optimized using Mean Squared Error (MSE) loss for both MultiTask and Denoise setups.We used a custom network combining a 3D U-Net [38], with a 3D ResNet18 encoder [39].The epoch with the lowest validation MSE is saved.The OCT volumes are randomly cropped for all pretraining methods to the volumetric shape (128x128) and 48 BScans to allow a reasonable batch size during training.We used the optimizer AdamW with a batch size of 24.We trained with a fixed learning rate of 0.0001.The pretraining experiments are run on a single Nvidia A100 GPU.
Downstream segmentation: the pretrained U-Net is fine-tuned on the downstream segmentation task.The final convolutional layer is replaced with a new convolution layer with one or two output feature maps followed by a sigmoid activation layer.For both datasets, the network is trained with a mix (equal contribution) of binary cross-entropy and Dice loss, for Dice Loss, background was ignored.When there are two fluid classes, they are treated as two binary problems instead of considering a multi-class setting, which was giving lower performance.For the 8 first epochs, only the last convolutional layer is trained, in order to initialize it, then the whole network is fine-tuned for 512 epochs.We tested different learning rates ranging from 0.001 to 0.00001 and the best setting was selected using the validation Dice score.
Denoise transform parameters: Following [22] and our initial experiments, Denoise models are trained with a noise scale σ of 0.22.

Pretraining performance
We present here the results from the pretraining using two different pretraining methods: MultiTask and Denoise.For both methods, under the Encoder-Decoder (ED) and Decoder-only (D) setting, training losses decreased rapidly and saturated after a few dozen epochs.Overall, Denoise pretraining was approximately 50% shorter than MultiTask and hybrid pretraining reduced the pretraining time by 13% for Denoise and by 8% for MultiTask.However, hybrid models (-D) still reached a higher MSE compared to their fully pretrained counterparts (-ED).This was expected since hybrid methods have less trainable parameters.The training curves are displayed in Fig. 3. Examples of reconstruction performed by different approaches are shown in Fig. 4. In this example for the MultiTask-ED method (Fig. 4 second column), we can notice an improvement in the reconstruction of the external limiting membrane (ELM) (see yellow arrow), from an area that was completely masked in the input, compared to its MultiTask-D counterpart when the encoder was already pretrained and frozen.Similarly, for Denoise pretraining, the ED version demonstrates better noise separation around the thin structure of the ELM (blue arrows), which appears sharper after denoising.

Downstream performance on fluid segmentation in CNV
Full dataset: We evaluate four different pretrained models on fluid segmentation for IRC and SRF fluid compartments against a model trained from scratch.Results are displayed in Table 2 and in Fig. 5.We observe that pretraining always improves results for both metrics, with the best results obtained by Denoise-ED pretraining.The overall best Dice for IRC was 0.754 ± 0.024 and 0.819 ± 0.049 for SRF fluid.For IRC fluid, all pretrained models obtained significantly higher Dice score compared to trained from scratch.In terms of AVD, significant improvement is observed for IRC fluid for all models except MultiTask-ED.
For SRF fluid, all pretrained models (MultiTask models and Denoise-ED) except Denoise-D reached a significantly higher (p<0.05)Dice score, and AVD was also lowered.In general, the hybrid model MultiTask-ED gave the smallest gain compared to training from scratch.
Reduced dataset: We evaluated the same models with reduced amounts of fine-tuning data, ranging from 20% to 80% of the original training data.The performance for each metric with respect to the amount of training data is shown in Fig. 6.First, as expected, we can observe   that performance increases when more training data is available, however for all data settings, improvement from pretraining was substantial.Overall, the best results are obtained with Denoise-ED method, sometimes paralleled by the hybrid version (Denoise-D).Under the most data-constrained setting (20%), pretraining allowed an improvement of Dice score of 22% for IRC and 6% for SRF.Moreover, as shown in the figures (Fig. 6 dashed line), the best performance from a model trained from scratch can already be achieved with 40 to 60% of the training data, depending on the evaluation metric.For this dataset, pretraining allows to effectively reduce the amount of annotation or to increase the maximal performance.Detailed numerical results are provided in the Annex (Table 4), together with qualitative results of segmentations (Fig. 7 and Fig. 9).

Downstream performance on fluid segmentation in DME
In this dataset, SRF only is considered, and we included the experiment with the limited training data (50%).With the entire dataset, we can observe a clear benefit from pretraining the models, with a significant gain in Dice score (p<0.01)compared to the model trained from scratch.Pretrained models achieved similar results with the best performance by Denoise-ED pretraining for Dice score and AVD.All pretrained models improved SRF Dice score significantly over training from scratch, but the drop in AVD was more limited.Results are listed in Table 3 and displayed in Fig. 8. Reduced dataset: For the reduced dataset, the gain in Dice score obtained with pretraining becomes even larger.Similarly as with the whole training dataset, the best performing model was Denoise-ED.This model obtains higher Dice score (0.792 ± 0.078) than the model trained from scratch on the entire dataset (0.744 ± 0.096).Although AVD scores are better with pretraining, the differences are not significant, probably because of a higher sensibility of the metric to outliers.Results are displayed in Table 3 and Fig. 8.

Conclusion
Segmentation of 3D retinal OCTs has become an important tool in ophthalmology, however, current methods suffer from data inefficiency as segmentation annotations are difficult and costly to produce.To overcome this problem, we explored denoising and restoration-based SSL pretraining: we pretrained with this methods on a large unlabelled dataset, then transferred the weights with finetuning on two fluid segmentation tasks.In our experiments, we could observe that denoising-based SSL was the better strategy.Indeed, while being faster to pretrain and requiring less hyperparameter tuning (a single parameter), it achieved the best results in the majority of evaluations, outperforming other methods.
For both datasets, we repeated the segmentation experiments in a limited data setting.In these experiments, pretraining showed an even greater improvement in performance over training from scratch.For the segmentation of Fluid-CNV and the UMN-DME datasets, around 50% of the training samples allowed to reach the same performance as when using the full training set.Overall, Denoise-based pretraining enabled increasing the maximal segmentation performance or alternatively, reducing the amount of required manual annotation for a certain level of performance.
Contrary to Denoise pretraining, we could observe in our experiments that MultiTask methods had several limitations.The pretraining duration was significantly extended, and involved more hyperparameters (transformations) compared to Denoise (single parameter).Finally, although MultiTask pretraining enhanced performance over scratch training, gains were limited relative to Denoise models.It seemed denoising was a more relevant corruption task than the mix of tasks in MultiTask.The latter could suffer from a wrong balance between the different tasks, which is difficult to correct given the number of hyperparameters.Moreover, denoising is at the heart of other successful methods, such as denoising diffusion models [21] or regularisation problems [40], which proves its versatility and the ability to learn powerful representations with this simple task.
We tested a form of hybrid pretraining, where we used pretrained encoders and extended them to fully pretrained segmentation networks.These models only slightly under-performed their fully-pretrained counterparts on both tasks, despite a reduction in the pretraining time.This could constitute a convenient trade-off if training resources are limited.Moreover, our hybrid models were based on Kinetics encoders, which are expected to be suboptimal for retinal analysis, therefore, for future experiments, it would be relevant to include encoders trained directly on OCTs.
During our study, we limited the size of our main network (3D U-net with a 3D ResNet18 encoder) to facilitate the heavy computational pretraining, which could have limited the final performance.Indeed, to fully exploit 3D pretraining on our large dataset, as future work it would be of interest to repeat these extensive experiments with larger 3D segmentation networks, or even with Transformer-based networks.Although 3D U-Nets are still relevant and keep being the state of the art for medical image segmentation [41], this could also allow us to confirm that the observed gains are present with other types of network architectures.
In conclusion, image restoration SSL, especially based on denoising, allows to effectively pretrain 3D segmentation networks for OCT segmentation.While being easy to set up, it leads to improved downstream performance or enables reducing the amount of required annotation work.

Fig. 1 .
Fig. 1.General pretraining architectures and setup for Denoise and MultiTask self-supervised learning settings, with a transfer to the downstream segmentation task.

Fig. 2 .
Fig. 2. Example of the transformations applied to OCT volumes in MultiTask pretraining.The original OCT volume (left) is transformed randomly through four operations: inpainting, outpainting, local voxel shuffling and non-linear intensity shifts (NLIS)

Fig. 3 .
Fig. 3. Validation MSE during the pretraining of Denoise (left) and MultiTask (right) methods.Each method has the ED (encoder-decoder) version, where the whole network is pretrained and the hybrid version D where the encoder is frozen.The total training time is also displayed for each setting (single Nvidia A100 GPU).

Fig. 4 .
Fig. 4. Examples of MultiTask or Denoise pretraining cases, the upper row represents input volume and the lower row the reconstructed images.(*) For denoising, as the noise is predicted, we displayed the input where we subtracted the predicted noise from the input image to get the denoised image for the sake of visualization.

Table 2 .
Segmentation results on the Fluid-CNV dataset, with Dice Score and Absolute Volume Difference (AVD) for IRC and SRF classes.We report mean value ± standard deviation across folds.Pretrained models are compared against the ones without pretraining and the statistical significance is represented with asterisks (* p-value < 0.05, ** p-value < 0.01, *** p-value < 0.001).

Fig. 5 .
Fig. 5. Segmentation performance on Fluid-CNV dataset, with Absolute Volume Difference (AVD) (first row) and Dice Score (second row) of the four models for two fluid classes SRF and IRC.

Fig. 6 .
Fig. 6.Segmentation performance on Fluid-CNV dataset, with Dice Score and Absolute Volume Difference (AVD) of the four models for two fluid classes, IRC and SRF.The models are fine-tuned with the amount of training data ranging from 20% to 100%.The dotted line represents the performance obtained with no pretraining and 100% of the data.

Fig. 7 .Table 3 .
Fig. 7. Segmentation example on the Fluid-CNV datasets.Each line corresponds to a certain amount of training data from 20-80%, with the segmentations from the five different models.True positives are displayed in green, false positives in blue, and false negatives in red.Table 3. Segmentation results on the UMN-DME dataset, with Dice Score and Absolute Volume Difference (AVD) for SRF, for the entire dataset and its reduced version (50%).We report mean value ± standard deviation across folds.Pretrained models are compared against the ones without pretraining and the statistical significance is represented with asterisks (* p-value < 0.05, ** p-value < 0.01, *** p-value < 0.001).

Fig. 8 .
Fig. 8. Segmentation performance on UMN-DME dataset, with Dice Score and Absolute Volume Difference (AVD) for five models for SRF fluid segmentation.The models are fine-tuned with 50% and 100% of the data.

Fig. 9 .
Fig. 9. Example of a segmentation on the Fluid-CNV and UMN-DME datasets.The ground truth (yellow) is displayed first, then the segmentation from the different models.True positives are displayed in green, false positives in blue, and false negatives in red.