DISCOVER: 2-D Multiview Summarization of Optical Coherence Tomography Angiography for Automatic Diabetic Retinopathy Diagnosis

Diabetic Retinopathy (DR), an ocular complication of diabetes, is a leading cause of blindness worldwide. Traditionally, DR is monitored using Color Fundus Photography (CFP), a widespread 2-D imaging modality. However, DR classifications based on CFP have poor predictive power, resulting in suboptimal DR management. Optical Coherence Tomography Angiography (OCTA) is a recent 3-D imaging modality offering enhanced structural and functional information (blood flow) with a wider field of view. This paper investigates automatic DR severity assessment using 3-D OCTA. A straightforward solution to this task is a 3-D neural network classifier. However, 3-D architectures have numerous parameters and typically require many training samples. A lighter solution consists in using 2-D neural network classifiers processing 2-D en-face (or frontal) projections and/or 2-D cross-sectional slices. Such an approach mimics the way ophthalmologists analyze OCTA acquisitions: 1) en-face flow maps are often used to detect avascular zones and neovascularization, and 2) cross-sectional slices are commonly analyzed to detect macular edemas, for instance. However, arbitrary data reduction or selection might result in information loss. Two complementary strategies are thus proposed to optimally summarize OCTA volumes with 2-D images: 1) a parametric en-face projection optimized through deep learning and 2) a cross-sectional slice selection process controlled through gradient-based attribution. The full summarization and DR classification pipeline is trained from end to end. The automatic 2-D summary can be displayed in a viewer or printed in a report to support the decision. We show that the proposed 2-D summarization and classification pipeline outperforms direct 3-D classification with the advantage of improved interpretability.


Introduction
Diabetic Retinopathy (DR), a complication of diabetes, is a major and growing cause of vision impairment and blindness.By 2040, around 600 million people throughout the world will have diabetes (Ogurtsova et al., 2017), a third of whom will have DR (Yau et al., 2012).One major problem in the management of DR is its reliance on an older imaging technique, namely Color Fundus Photography (CFP).Various classifications based on CFP were proposed over the years (David et al., 1969;ETDRS Research Group, 1991;Wilkinson et al., 2003).Unfortunately, decisions based on these classifications have poor predictive power.For instance, a severe non-proliferative DR case evolves to proliferative complication in 51.5% of cases, with only 17.1% evolving to a high risk of blindness (ETDRS Research Group, 1991).This makes the management of DR challenging: clinicians often err on the side of caution and treat all those patients to mitigate the risk of complications.In the past decades, a significant number of studies have relied on CFP images for the automatic assessment of DR.The application of machine learning techniques, particularly deep learning, to these images has shown promising results in the detection and categorization of DR (Ting et al., 2019;Quellec et al., 2021).While these advancements are noteworthy, such techniques will always suffer from the poor predictive power of CFP.Fortunately, new imaging modalities are emerging that may improve predictions.
Optical Coherence Tomography (OCT) is a non-invasive imaging technique that uses interferometric information of partially coherent light to create cross-sectional (2-D B-scans) and three-dimensional (C-scans) structural images of biological tissues.Within optical scattering media, it can penetrate a few millimeters in depth, with micrometer resolution, and is therefore particularly well-suited to image the retina (Hitzenberger et al., 2003).OCT Angiography (OCTA) is a motion-sensitive extension of OCT enabled by fast OCT acquisitions: it was shown to contrast the retinal vasculature and can provide quantitative blood flow information (Baumann et al., 2011).An OCTA acquisition can be summarized by two volumes: a structure volume, obtained by averaging consecutive 3-D scans, and a flow volume, describing the amplitude of local intensity variations across those consecutive 3-D scans (Gorczynska et al., 2016).To analyze the blood flow in specific vascular plexuses, clinicians generally inspect en-face (or frontal) Maximal Intensity Projections (MIP) of the flow volume in the corresponding retinal or choroidal layers (Gorczynska et al., 2016).In such 2-D projections, each 1-D A-scan is replaced with the maximal intensity value (throughout the entire A-scan or within the considered layers only).Recent OCTA devices enable ultra-widefield acquisitions of the retina (90 • ) (Niederleithner et al., 2023).In summary, OCTA can capture ultra-widefield volumetric structural and functional (flow) images of the retina: it is, therefore, a promising technique to diagnose various ocular pathologies (Alam et al., 2019;Yang et al., 2023;Xu et al., 2023;Zang et al., 2023).In particular, DR can clearly benefit from OCTA (Yang et al., 2022): 1) the structure volume allows objective and quantitative assessment of diabetic macular edema (DME), 2) flow MIPs allow quantification of retinal vascular plexuses, non-perfusion and vessel density as well as the identification of damage; Vujosevic et al. (2021) lists the various biomarkers of DR and DME in OCTA acquisitions.
Theoretically, the radiomics approach and the 2-D classification approach are suboptimal in the sense that relevant features useful for classification may have been lost during the preprocessing (MIP) and feature extraction steps.However, the 3-D classification approach also has some limitations.First, compared to their 3-D counterparts, 2-D neural architectures have better pre-trained weights (e.g., ImageNet weights) and have fewer parameters to optimize, thus requiring fewer training samples.Given the limited OCTA data sets available, this aspect is critical.Second, end-to-end 3-D classification lacks the interpretability power of the radiomics approach and, to a lesser extent, of the end-to-end 2-D classification approach.To alleviate these limitations, we propose an end-to-end 3-D image classification approach relying on 2-D views as intermediate steps, the 2-D view extraction process being trainable.This guarantees that: 1) 2-D neural architectures can be used at the end of the classification pipeline, while 2) relevant problem-specific features can be extracted at the beginning of the pipeline.In detail, two types of view are extracted: 1. 2-D en-face (or frontal) projections, generalizing the flow (or structure) MIP images used to assess OCTA flow features (Yang et al., 2022), 2. selected 2-D slices (B-scans), often used to assess structural OCT features (Yang et al., 2022): attribution methods (Sundararajan et al., 2017;Samek et al., 2017) are used to identify the most relevant scan lines in the en-face projection, which are deemed worthy of further investigation in 2-D.
To maximize the interpretability of en-face projections, a novel "model dropout" mechanism is introduced.Projections are processed by an ensemble of 2-D image classifiers, and during training, classifiers in the ensemble are dropped randomly.The extracted 2-D features thus become more classifierindependent, i.e., more general and hopefully more meaningful to the human eye.This paper aims to automate the most recent DR severity classification (Wilkinson et al., 2003) from OCTA acquisitions, using the proposed interpretable classification approach, one step toward better management of DR.

3-D→2-D Projection
In recent years, 3-D→2-D projection was used to solve various medical image analysis tasks.One such task is 2-D/3-D registration: in that case, the 2-D image is registered to a 2-D projection of the 3-D image.Fei et al. (2006) used that strategy to register 2-D X-ray images to 3-D Computed Tomography (CT) images: the goal was to compare the ability of both modalities to detect cardiac calcifications.More recently, Schaffert et al. (2018) and Jaganathan et al. (2023) registered intra-operative 2-D X-rays to pre-operative 3-D CT images for navigation purposes during minimally invasive surgeries.Alternatively, Van Houtte et al. (2022) registered intra-operative 2-D X-rays to a pre-operative 3-D atlas.For 2-D/3-D registration, 3-D→2-D projection is typically achieved by simulating a 2-D perspective projection from the 3-D image to the plane: in Schaffert et al. (2018) and Jaganathan et al. (2023), a linear system of equations (point-to-plane correspondences) is solved; Van Houtte et al. (2022) relies on projective spatial transformers.In Jaganathan et al. (2023), an adversarial domain adaptation step is added.
3-D→2-D projection was also used for en-face segmentation of 3-D images.This is useful when ground truth masks are obtained using en-face projections of the 3-D images or using a different 2-D imaging modality.Li et al. (2020) segmented blood vessels and foveal avascular zones in 3-D structural and flow OCTA images: the ground truth masks were obtained using the flow MIP between the Internal Limiting Membrane (ILM) layer and the Outer Plexiform Layer (OPL).Lachinov et al. (2021) segmented geographic atrophies, a sign of agerelated macular degeneration, in 3-D structural OCT images: the ground truth masks were obtained using 2D Fundus Auto Fluorescence (FAF) images.They also segmented blood vessels in 3-D OCT: the ground truth masks were obtained using en-face OCT images.Le et al. (2021) segmented blood vessels in 3-D Adaptive Optics OCT (AO-OCT) images: ground truth masks were obtained using en-face AO-OCT images.For en-face segmentation of 3-D images, variations on the U-Net architecture are typically used: in the proposed architectures, the encoder part contains 3-D operations, and the decoder part contains 2-D operations (Li et al., 2020;Lachinov et al., 2021).Alternatively, an LSTM is combined with the encoder part of a 2-D U-Net in Le et al. (2021).
A variation on the previous task is the generation of highquality 2-D images from 3-D images for visualization purposes or, optionally, for downstream segmentation tasks.Forsgren et al. (2022)  3-D→2-D projection was also used for fast feature detection or segmentation in 3-D images.In this case, ground truth masks are obtained from 3-D images: 2-D projections are simply used as intermediate steps.Shen et al. (2021) detected 3-D junction points from various tree-like structures (blood vessels, neurons) in 2-D projections, followed by 2-D→3-D reverse mapping.Similarly, Wang et al. (2021) segmented 3-D microvessels in 2-D projections of 3-D brain Magnetic Resonance Imaging (MRI), followed by 2-D→3-D reverse mapping.For this task, non-parametric 3-D→2-D projections are used, namely MIP (Shen et al., 2021;Wang et al., 2021).Finally, (Guo et al., 2021) segmented blood vessels in OCTA images.A variation on MIP was used in that study: all voxel intensi-ties in the same retinal layer are multiplied by a layer-specific trainable weight prior to MIP.
Unlike previous 3-D image classification tasks, we propose a trainable parametric 3-D→2-D projection to allow the extraction of discriminant and interpretable problem-specific features.The proposed approach is more general than (Guo et al., 2021)'s solution for segmentation, which solely emphasizes specific retinal layers.Other trainable techniques proposed for 1) 2-D/3-D registration, 2) en-face segmentation of 3-D images, or 3) high-quality 2-D image generation assume that dense ground truth (e.g., 2-D images) is available to supervise 3-D→2-D projection training.This is not our case: the only supervision signal at our disposal is the DR diagnosis.This calls for a different architecture to ensure that feature localization is not lost in the en-face plane, as demonstrated in this paper.

Attribution Methods
Over the past decade, with the growing popularity of deep learning, various methods were introduced to visualize what "black box" CNNs have learned.The purpose was notably to attribute an importance score to each image pixel for a given output target (e.g., a class prediction or a neuron activation).They can be classified into gradient-based or perturbation-based attribution methods.
The simplest gradient-based attribution method is the Saliency method by Simonyan et al. (2014), which computes the gradient of the output target with respect to each image pixel.This approach is advantageous because it allows pixellevel attribution at the cost of a single gradient backpropagation.However, not all operations in a CNN are invertible, which often leads to unsatisfactory attributions.Zeiler and Fergus (2014) thus introduced a mechanism to correctly backpropagate attributions through MaxPool operations (Deconvolution method), and Springenberg et al. (2015) introduced another way to correctly backpropagate them through ReLU activations (Guided Backprop method).These ideas were extended in Layer-Wise Relevance Propagation (LRP) and Deep Learning Important FeaTures (DeepLIFT), with additional properties: LRP ensures that the magnitude of any output is conserved through the backpropagation process (Bach et al., 2015); DeepLIFT bases the attributions on the difference between the activation of each neuron (for the current image) to its "reference activation" (Shrikumar et al., 2017).The simplest perturbation-based attribution method is the occlusion method by Zeiler and Fergus (2014), which studies how the output target is impacted when parts of the image (square patches) are occluded.Ribeiro et al. (2016) introduced Local Interpretable Model-agnostic Explanations (LIME), which looks for superpixels with the strongest association with the output target by successively turning superpixels off and on.Finally, Integrated Gradients can be regarded as both a gradientbased and a perturbation-based attribution method: 1) a set of images is generated by multiplying all pixel intensities in the target image by a constant factor ranging from 0 (the first image or reference) to 1 (the last image or target); 2) for each image in the series, a gradient-based attribution score is computed; 3) the final attribution is the integral of gradients over the series (Sundararajan et al., 2017).Intuitively, the most relevant features are detected early in the series and have, therefore, a higher integral.Unlike the other gradient-based attribution methods, multiple gradient backpropagation operations are needed.More generally, all perturbation-based attribution methods require multiple CNN evaluations, implying higher computation times and memory requirements.
To the best of our knowledge, this study is the first attempt to use attribution methods to extract and analyze relevant 2-D slices in a 3-D volume.

Overview and Notations
The geometry of an OCT/OCTA acquisition is illustrated in Fig. 1.Let y denote the depth axis along which the partial coherent light penetrates the tissues (A-scan).Let x denote the fast scanning axis: a B-scan is thus indexed by x and y.Let z denote the slow scanning axis: a volume (C-scan) is thus indexed by x, y, and z and en-face projections by x and z.Let X × Y × Z denote the size of the C-scans in voxels.Finally, a multi-channel volume (e.g., with a flow and a structure channel) is indexed by c, x, y, and z.
Given I, a preprocessed multi-channel OCTA acquisition (see section 3.2), and N acquisition-level labels, the goal is to predict whether or not experts would assign the n-th label to acquisition I, n = 1..N.In the experiments (see section 4), the goal is to assess DR severity in the patient's eye, according to the 5-level ICDR scale (Wilkinson et al., 2003): no DR (level 0), mild non-proliferative DR (NPDR, level 1), moderate NPDR (level 2), severe NPDR (level 3), proliferative DR (PDR, level 4); DR severity assessment is formulated as an N-label classification problem (N = 4): is DR severity d(I) greater than or equal to level n? Let .N} denote the probabilistic predictions and let λ n (I) ∈ {0, 1}, n = 1..N, denote the ground truth labels.
As illustrated in Fig. 2, the preprocessed 3-D acquisition I is converted to a 2-D summary image Π(I), defined as a parametric en-face projection of I (see details in section 3.3).Π(I) is defined as a color (3-channel) image for two reasons: 1. Interpretability: it can be displayed in a viewer or inserted in a report for human inspection.2. Compatibility with off-the-shelf 2-D neural architectures with ImageNet pre-trained weights.
Next, Π(I) is classified by a first ensemble C 1 of 2-D offthe-shelf image classifiers, as described in section 3.4.A first estimation p (1) (I) of the probabilistic prediction p(I) is given by Then, based on attributions derived from p (1) (I), the most relevant B-scans S (I) are selected, as described in section 3.5.For interpretability purposes, the number of selected B-scans is limited to N: one per classification output.The motivations are: 1.A small number of B-scans is compatible with human inspection in a viewer or a report.2. Each selected B-scan is associated with a severity cutoff and can, therefore, be used to document the course of action associated with that cutoff (treatment, follow-up, etc.).
Finally, the selected B-scans S (I) are classified by a second ensemble C 2 of 2-D off-the-shelf image classifiers: a second estimation p (2) (I) of the probabilistic prediction is given by C 2 • S (I).It is combined with p (1) (I) to obtain the final probabilistic prediction p(I), as described in section 3.6.

Preprocessing
An OCTA acquisition is stored as two volumes: 1. a structure volume S , where the retinal layers and various retinal anomalies (e.g., fluid) are visible, among other structures (e.g., the choroid, below the retina, and the vitreous core, above it), 2. a flow volume F, where the blood vessels of the retina and the choroid are particularly highlighted.
Additionally, OCTA acquisitions are usually associated with: 1.A 2-D en-face localizer l, aligned with the OCTA data (size: X × Z pixels), to track eye motion.The PLEX Elite 9000 (Carl Zeiss Meditec Inc. Dublin, California, USA) device, for instance, is associated with a Line Scanning Ophthalmoscope (LSO) subsystem for that purpose (Niederleithner et al., 2023).2. Automatically-segmented surfaces delineating the vitreoretinal interface, namely the Inner Limiting Membrane (ILM), and the chorioretinal interface, below the Retinal Pigment Epithelium (RPE).Let s sup and s in f denote those two surfaces, respectively.They are stored as matrices of X × Z pixels: s(x, z) represents the depth of surface s in the (x, z) A-scan of volumes S or F.
Although the LSO image is used primarily for motion tracking, it offers a complementary view on the retina (different optical properties, different wavelength, etc.), analogous to a grayscale fundus image (Niederleithner et al., 2023).We propose to analyze it jointly with the OCTA data.Therefore, we propose to preprocess an OCTA acquisition as follows (see Fig. 3).First, an "LSO volume" L is created by duplicating the LSO localizer along the y-axis: L(x, y, z) = l(x, z), ∀x, y, z.
Second, a mask volume The flow, structure, and LSO volumes are multiplied by M, element-wise, to mask the choroid and vitreous core out.Let I ′ denote the 3-channel volume: (1) Third, the retinal region is flattened by shifting all voxels of I ′ along the y-axis, so that the ILM surface is set to a small constant depth y = Y 0 .Let I ′′ denote the resulting volume: (2) Parameter Y 0 is set to a non-zero value (0 < Y 0 < Y) to limit the loss of useful information during random data augmentation (see section 3.6.1).Its value is determined as described in section 4.3.This flattening process ensures that all the relevant information is concentrated at the top of I ′′ .The advantage of flattening the ILM surface, instead of the RPE (s in f ), is that its segmentation is less error-prone, due to a better contrast.This reduces the risk of alignment errors, which would lead to discontinuities between neighboring A-scans.Fourth, I ′′ is cropped: all voxels with a depth Parameter Y 1 is chosen to ensure the retinal region is never occluded.The final preprocessed image I denotes the cropped version of I ′′ .

3-D→2-D Projection
The preprocessed 3-D acquisition I is then converted to a 2-D summary image Π(I) through a parametric 3-D→2-D enface projection Π. Lachinov et al. (2021) proposed a U-Net-like architecture for Π, where the encoder part contains 3-D operations and the decoder part contains 2-D operations.U-Net-like architectures have smaller and smaller activation maps as we go deeper into the encoder part, and their size increases as we go deeper into the decoder part to finally reach the size of the input image.This contraction aims to increase the receptive field of deep encoder filters to better consider the context without increasing their size and, therefore, the number of network parameters.The drawback of this contraction is that small details are lost in the process.To recover those details, skip-connections are therefore introduced between encoder and decoder layers (Ronneberger et al., 2015).However, this trick assumes that the ground truth signal contains small details.In particular, it requires a dense supervision signal.For a classification task, the class labels are the only supervision signals available for training Π.
Our solution to the problem is to ensure that the details in the en-face plane are never lost throughout the 3-D→2-D projection process.In particular, we guarantee that the activation maps all have the same size in the en-face plane (X × Z pixels).Only the depth of these activation maps decreases as we go deeper in Π, to reach a final depth of 1 voxel (i.e., a 2-D image).To further prevent the loss of details in the en-face plane, we also limit the receptive field of the filters to one pixel in that plane.A simple solution based on 1-D operations, where each A-scan is processed independently, is presented hereafter.It should be noted that Li et al. (2020)'s projection network for segmentation also ensures that activation maps all have the same size in the en-face plane.However, Li et al. (2020) use 3-D convolution operators (kernel size: 3 × 3 × 3 voxels): with a receptive field larger than one voxel in that plane, there is no guarantee that details are preserved without a dense supervision signal.
Note that the pooling operator precedes the convolutional layers in order to limit network complexity: since no contraction in the en-face plane is performed, this is critical.An average pooling operator is used in the first block; otherwise, half of the voxels would never be used.However, a max pooling operator is used in the following blocks to add more nonlinearity.Following common practice (He et al., 2016), the number of convolutional filters increases as we go deeper into the network.Let Φ denote the number of filters per layer in the first block.The number of filters per layer in the i-th block is set to 2 i−1 Φ.
Each block reduces the depth by a factor of 4 (2 due to pooling × 2 due to the stride in the first convolution layer).After three blocks, a global mean operator along the depth axis is performed to eliminate the depth dimension.Finally, a dense layer with sigmoid activation is applied to obtain a 2-D image with the desired number of channels, namely three channels (see section 3.1).The sigmoid activation facilitates conversion to a bitmap image for visualization.

Classification of the 3-D→2-D Projection
Now that a 2-D color image Π(I) is obtained, any image classifier C 1 can be used to predict DR severity: a CNN, a transformer, an ensemble of CNNs and/or transformers, etc.However, two novel contributions specifically suited to this proposed framework are presented hereafter and illustrated in Fig. 5.

Ensuring 3-D→2-D Projection Generality
For interpretation purposes, we would like Π(I) to be as independent from the classifier as possible.The rationale is as follows: if the projection is useful for any classifier, then we expect it to be informative for human experts as well.Various solutions can be considered: • Following federated learning, multiple Π ( j) , C ( j) 1 couples can be trained in parallel, and the weights of the Π ( j) instances can be aggregated at regular intervals.
• Following continual learning, multiple classifiers C ( j) 1 can be trained sequentially, initializing training with the weights obtained for Π with the previous C ( j−1) 1 classifier.
However, such approaches imply longer training or require more resources.Instead, we propose to train one ensemble of classifiers, but with one trick that we call model dropout: for each mini-batch, a random subset of the classifiers is used for prediction.Like the other solutions mentioned above, this ensures that the 3-D→2-D projection Π does not specialize for one specific classifier or one static ensemble of classifiers.Let γ k , k = 1..K, denote the classifiers of the ensemble.We assume that these classifiers have no final activation function (i.e., they return logits).The ensemble prediction is given by: where σ denotes the sigmoid function.The number of possible classifier combinations is given by 2 K − 1.This process is equivalent to training 2 K − 1 classifiers in random order, which is expected to improve the generality of Π. Model dropout is only used during training: the full ensemble is used during inference.The expected benefit of logit averaging (see Eq. ( 3)) is that the weight of poor classifiers in the ensemble can be reduced automatically by decreasing the amplitude of their logits.However, logit averaging may have drawbacks (Tassi et al., 2022), so the traditional strategy, namely probability averaging, was also investigated.

Data Augmentation
Data augmentation is typically performed by randomly transforming preprocessed images before feeding them to the neural network.For 2-D image classifiers, random transformations traditionally imply random affine transformations (random rotation, translation, and scaling) and random horizontal/vertical flips.However, our input preprocessed images are heavy 3-D volumes.Applying such random transformations to the 3-D volume takes a lot of time.Instead, we propose to apply them after the 3-D→2-D projection: applied to 2-D data, they are much faster.Besides, applying random spatial transformations before the projection is of limited interest since the proposed projection operator Π does not consider the context.Inserting random transformations inside the neural architecture is made possible by differentiable implementations of these transformations 1 .
Since these random transformations can be inserted inside the neural network, we are able to generate one transformed version of Π(I) for each classifier γ k in the ensemble.This leads to a new definition for C 1 • Π: where T denotes the transformation operator and ε k the random transformation parameters drawn for γ k .As a way to generalize test-time data augmentation (Krizhevsky et al., 2012), random transformations are applied during both training and inference.

Relevant B-scan Selection
A first estimation p (1) (I) = C 1 • Π(I) of the probabilistic prediction p(I), based on the en-face 3-D→2-D projection Π, is now available.We propose to investigate further those B-scans of I which contribute the most to p (1) (I).The idea is to find additional evidence to increase or decrease the confidence in this first estimation.
To detect the B-scans that contribute the most to p (1) (I), we propose to use attribution methods presented in section 2.2.Note that attributions are computed for one particular output prediction p (1) n (I), i.e., for one DR severity cutoff.This aligns with our goal to collect additional evidence for each prediction: we will select one B-scan per prediction.As for the inputs, we can either: • use the 3-D preprocessed acquisition I and accumulate voxel-wise attributions in the xy-plane.
• or use the 2-D projection Π(I) and accumulate pixel-wise attribution along the x-axis (the fast scanning axis).
The second option was chosen for faster computations.Let a I (x, z, c, n) denote the attribution of pixel (x, z), in the c-th channel, for the n-th prediction.A normalized attribution α I (z, n) is defined for the z-th B-scan, with respect to the n-th prediction: Let z n denote the index of the n-th selected B-scan and let B n (I) denote its content: For inference, the B-scans maximizing α I (z, n), n = 1..N, are selected.However, for data augmentation purposes and to favor exploration, a random B-scan selection process is preferred 1 https://pytorch.org/vision/main/transforms.html during training: z n is randomly drawn from the multinomial probability distribution defined by α I (z, n): It should be noted that B-scan selection is not impacted by random transformations from section 3.4.2,which are an integral part of the C 1 classifier on which the attribution method operates (see Eq. ( 4)).
3.6.Final Classification 3.6.1.Second Classifier Like classifier C 1 , the second classifier C 2 also requires data augmentation.Besides the random selection process described above, we propose to apply the same random transformation T as for classifier C 1 .More generally, we define C 2 very similarly to C 1 : an ensemble of classifiers γ ′ k , k = 1..K, with random transformations (parameters: Because their input images are of a different nature, no parameter sharing was set up between C 1 and C 2 . By design, the n-th selected B-scan B n (I) is meant to correct the confidence in the n-th prediction.Therefore, we only consider the n-th prediction γ ′ k,n (B n (I)) of classifier γ ′ for B-scan B n (I).This leads to the following expression for the predictions p (2) (I) of C 2 :

Final Classifier
The second classifier C 2 is supposed to increase or decrease the confidence in the predictions of the first classifier C 1 .Therefore, the logits from both classifiers are combined linearly to obtain the final probabilistic prediction: where σ −1 is the logit function.

Training
The multi-label classifier thus defined is trained to minimize the binary cross-entropy L between network predictions p n (I) and ground truth labels λ n (I), n = 1..N: Thanks to this loss function, the ordered nature of DR severity grades is taken into account (see section 3.1).If a prediction is wrong by one severity level, then only one binary classifier will have an incorrect prediction.Whereas if a prediction is wrong by n > 1 severity levels, then n binary classifiers will have an incorrect prediction, and will therefore impact the global loss L more.
We hypothesize that B-scan selection is most relevant when the first classifier C 1 is already well trained.Therefore, two training scenarios are investigated: Two-step training: C 1 is trained alone until convergence.
Then, its parameters are frozen, and C 2 is trained until convergence.
One-step training: C 1 and C 2 are trained jointly until convergence.
This concludes the presentation of the proposed framework.

Experiments
This framework is now evaluated for the task of automated DR severity assessment, according to the ICDR scale (Wilkinson et al., 2003), using OCTA.

Dataset
In this study, we used OCTA images from the " Évaluation Intelligente de la Rétinopathie diabétique" (EviRed) project2 , which comprises data collected between 2018 and 2022 from 14 hospitals and recruitment centers in France.The PLEX Elite 9000, with a scanning frequency of 200 kHz, was employed to capture Swept-Source (SS) OCTA images.The ocular data in the EviRed dataset typically include two acquisition types: high-resolution 6x3x6 mm3 SS-OCTA (500x1536x500 voxels) centered on the macula and lower-resolution 15x6x15 mm 3 ultra-widefield UWF-SS-OCTA (834x3072x834 voxels).Each OCTA volume contains 2-D en-face localizer, structural, and flow information.The EviRed dataset encompasses 811 eyes from 477 patients and is divided into training, validation, and test sets.It should be noted that, for a few eyes, we have only 6x6 mm 2 or 15x15 mm 2 acquisitions.The distribution of patients and images in each set is presented in Table 1.
The partitioning of the EviRed dataset into training, validation, and test sets followed a systematic approach to ensure a robust evaluation of the models.The process was guided by the following criteria: 1. Patient independence: To minimize the risk of data leakage, each patient's data was assigned to only one of the sets (training, validation, or test).This approach prevents the model from learning any patient-specific characteristics that could lead to overfitting or an inflated performance metric.2. Balanced distribution of disease severity: The dataset was divided in such a way that each set contained a similar proportion of images representing various stages of DR.
This balanced distribution ensures that the model is exposed to a wide range of severity levels during the training process and provides a more accurate assessment of its performance during validation and testing.3. Stratified sampling: To maintain consistency in the demographic characteristics and other factors, stratified sampling was employed when dividing the dataset.This approach not only ensures that each set contains a representative sample of the entire dataset but also respects the original class distribution in each split.By mirroring the class distribution of the whole dataset within each subset, we further reduce the risk of biased performance evaluation.Hence, we achieve a balanced representation of demographic characteristics and other factors across all subsets.This method gives us the confidence that the inferences drawn from our study will be robust and reliable.Because 6x6mm 2 and 15x15mm 2 acquisitions have different sizes and resolutions, distinct models were built for these two types of acquisitions.

Implementation
The proposed framework was implemented using PyTorch 3 Ignite for training and inference, MONAI4 for 3-D data handling, PyTorch Torchvision for differentiable data augmentation, PyTorch Image Models (timm)5 for 2-D image classification and Captum6 for attribution methods.Experiments were performed using two NVIDIA V100 GPUs (with 32 Gb of RAM each).One of the GPUs was dedicated to 3-D→2-D projection (Π), and the other one was dedicated to 2-D image classification (C 1 and C 2 ).

Hyperparameter Optimization
Various hyperparameters need to be set: • Architecture parameters: the use of C 1 alone or the joint use of C 1 and C 2 (see section 3.1).
• Preprocessing parameters: the depth Y 0 at which the ILM is aligned and the depth Y 1 under which the volume is cropped (see section 3.2).
• 3-D→2-D projection parameters: the number Φ of filters per convolutional layer in the first block and the use, or not, of skip-connections within blocks (see section 3.3).
• Data augmentation parameters: the range of values for random affine transformations ε and ε ′ (see sections 3.4 and 3.6.1).
• Ensemble parameters: the list of K classification backbones in C 1 and C 2 , and the use of logit or probability averaging (see sections 3.4 and 3.6.1).
• B-scan selection parameters: the attribution method used (see section 3.5).
• Training parameters: the general training parameters (optimizer, learning rate, etc.), training from scratch or the use of ImageNet pre-trained weights, and the problemspecific training schedule (one-step or two-steps -see section 3.6.3).
The preprocessing and data augmentation parameters were set empirically through visual inspection: Y 0 = 32 voxels, Y 1 = 224 voxels, random rotation in the range [−10; +10] degrees, random translation in the range [−10; +10] percent of X and Y 1 or Z, random scaling in the range [90; 110] percent of X and Y 1 or Z.The following parameters were restricted due to GPU memory limitations and computation time considerations: • The number Φ of filters was limited to Φ ≤ 32 for 6x6mm 2 acquisitions, Φ ≤ 16 for 15x15mm 2 acquisitions.
• Integrated Gradients and perturbation-based attribution methods could not be used during training.The following attribution methods were investigated: Saliency, Deconvolution, Guided Backprop, and DeepLIFT.
The classifier backbones were chosen among CNNs: one advantage of most CNNs is that they can be applied to images of any size (in our case: 500 × 500 or 834 × 834 pixels) without adaptation.Two ensembles of classifiers were considered, with the following pre-trained weights (unless training from scratch is experimented): Ensemble 1: The first set of classifiers was chosen from a single family, namely EfficientNet (Tan and Le, 2019).This family is popular for its good trade-off between complexity and performance.The K = 4 simplest networks were selected: namely EfficientNet-{B0, B1, B2, B3}, pre-trained on ImageNet-1K.This first ensemble was intended for quick experiments, hyper-parameter optimization, etc.
The values of the other hyperparameters were then optimized using ensemble 1 for 6x6mm 2 acquisitions.Optimization relied on a Receiver Operating Characteristic (ROC) analysis in the validation set: hyperparameter values maximizing the Area Under the ROC Curve (AUC), averaged over the N = 4 binary classification tasks, were selected.The following hyperparameters were obtained: • Full architecture (C 1 and C 2 ).
• Φ = 8 filters in the first block (and therefore 2Φ = 16 filters in the second and 2 2 Φ = 32 filters in the third), without skip-connections.
• Adam optimizer with an initial learning rate of 10 −3 and an exponential learning rate scheduler (99% multiplicative decay at every epoch), using pre-trained weights.

Analysis of 3-D→2-D Projections
Examples of 3-D→2-D projections obtained for 6x6mm 2 acquisitions using both ensembles of classifiers are presented in Fig. 6.They are compared with projections obtained using the U-Net-like projection (or U-Projection) proposed by Lachinov et al. (2021) for 3-D→2-D segmentation, and also with projections obtained without model dropout.Examples of 3-D→2-D projections obtained for 15x15mm 2 acquisitions of the same eyes are presented in Fig. 7. Next, we present in Fig. 9 examples of A-scan-level attributions (see Eq. ( 5)) obtained for C 1 with various attribution methods.Attributions were obtained for 6x6mm 2 acquisitions using ensemble 2. This figure illustrates that the Guided Backprop, DeepLIFT, and Integrated Gradients methods produce rather similar results.The Saliency and Deconvolution methods produce more noisy results.

Performance evaluation
Classification performance achieved in the test set for 6x6mm 2 and 15x15mm 2 acquisitions is presented in Table 2 (a).Results are reported for both ensembles and all N = 4 classification tasks.For a given ensemble and a given classification task, three AUC values are reported: the one obtained for the full classifier (predictions p(I)) and the ones obtained for each branch C 1 and C 2 of the classifier separately (predictions p (1) (I) and p (2) (I), respectively).ROC curves for the full classifiers are reported in Fig. 10.
To show the relevance of the second branch (classifier C 2 ), we compared the AUC values obtained using 1) C 1 alone, 2) C 2 alone, or 3) C 1 and C 2 jointly.For each of these 3 scenarios, a set of 16 AUC values is available in Table 2: Performance of the proposed framework in terms of area under the ROC curve (AUC) in the test set (a). Impact of the classification branches on performance (b).* Significant difference, according to a Wilcoxon signed-rank test on the paired AUC values (p < 0.05).

Ablation Study
Additional experiments were performed with 6x6mm 2 acquisitions, using ensemble 1, to study the impact of the main hyperparameters.Results of experiments investigating the first classification branch C 1 only are reported in Table 3, and those investigating the second branch are reported in Table 4. Please note that results are reported on the test, not on the validation set (used for hyperparameter optimization), so the ranking of solutions may contradict the optimal hyperparameter values listed in section 4.3 (associated with test results reported in Table 2).
In this ablation study, we only considered one acquisition and one CNN ensemble.With only four pairs of AUC values to compare, the previous statistical test is no longer suitable.Instead of comparing AUC values, we thus compared ROC curves directly: Delong tests were used for that purpose.Delong tests are designed for comparing two curves, not two sets of curves: each set of curves was thus micro-averaged (Aguilar-Ruiz and Michalak, 2022) prior to Delong testing.7)).* Significant difference with the reference (first line), according to a Delong test on the micro-averaged ROC curves (p < 0.05).

Comparison with a 3-D baseline
The proposed DISCOVER framework was also compared to a 3-D baseline model in terms of classification performance and inference times.The baseline model is a 3-D CNN processing the 3-channel 3-D preprocessed acquisition I, obtained as presented in section 3.2.We ensured that the same splits were used for training, validation, and testing.Various backbones and hyperparameters were evaluated.The selection process, consistent with the one applied to our proposed method, involved choosing the optimal model based on average AUCs and the best checkpoint for each severity cut-off on the validation set.The most favorable results for the baseline were achieved using a 3-D ResNet50 (He et al., 2016;Hara et al., 2018) for both 6x6mm 2 and 15x15mm 2 acquisitions.This baseline and the proposed framework are compared in Table 5; For the proposed framework, ensemble 1 was used for 6x6mm 2 acquisitions and ensemble 2 was used for 15x15mm 2 acquisitions, as these are the best ensembles on the validation sets.Classification performances are compared using both Delong tests and a Wilcoxon signed-rank test.

Discussion
We have presented a general 3-D image classification framework, which combines a trainable 3-D→2-D en-face projection step, followed by a 2-D en-face image classification step.It is further complemented by an auxiliary branch that extracts key 2-D cross-sectional slices (B-scans) and classifies them.The main purpose is to summarize 3-D information by complementary 2-D images (en-face and cross-sectional), for improved interpretability.This novel framework was applied to automated DR severity assessment using 3-D Optical Coherence Tomography Angiography (OCTA) acquisitions.This work aligns with our previous works on interpretable or explainable DR severity assessment (Quellec et al., 2017(Quellec et al., , 2021)).But unlike these previous works, which operated on 2-D Color Fundus Photographs (CFP), this work is applied to 3-D OCTA acquisitions.The additional dimension complicates visual feedback to human readers: the relevant information needs to be summarized so that it can be displayed on a 2-D screen or printed in a report.The proposed framework mimics and generalizes how ophthalmologists analyze OCTA acquisitions: en-face projections are often used to inspect the blood flow; cross-sectional views are often used to inspect structural anomalies.
Besides improved interpretability, by design, we show that this framework guarantees improved classification performance, compared to direct 3-D image classification.This can be explained by the large volume of information in OCTA acquisitions, which complex 3-D neural architectures cannot process efficiently with limited dataset sizes and limited GPU memory capacities.Summarizing also helps for this purpose.In particular, it allows access to a large collection of highly efficient 2-D neural architectures, with ImageNet pre-trained weights.

Analysis of 2-D Projections
Two types of OCTA acquisitions were analyzed in this study: high-resolution acquisitions centered on the macula (6x6mm 2 ) and lower-resolution ultra-widefield acquisitions (15x15mm 2 ).Fig. 6 shows that normal and pathological retinal features are highly visible in en-face projections for 6x6mm 2 acquisitions (columns 3 and 4).This figure highlights the benefit of processing each A-scan independently through 1-D convolutions: all details are lost through U-Net-like projections (U-Projection in column 5).The last column also advocates the joint training of multiple 2-D classifiers through the proposed "model dropout" mechanism: by training a single classifier (or static ensemble of classifiers), relevant details may be lost, as the unique classifier may focus on a subset of discriminant features and let the 3-D→2-D projector ignore the others, while still obtaining good classification performance (see Table 3).
Through comparison of Fig. 6 and 7, it appears that 15x15mm 2 en-face projections offer a reduced level of details.This makes sense, given the reduced resolution of input images (along all three axes).It is also possible that the 3-D→2-D projector was not able to capture details equally well: the problem is not just a reduced resolution of normal and pathological retinal features, but also a reduced contrast between these features and the background.However, Fig. 7 suggests that large and peripheral pathological features are well captured in 15x15mm 2 acquisitions, explaining that advanced DR stages are detected well in those acquisitions (see Table 2).
By design, the proposed 3-D→2-D projection operator Π, which processes each A-scan independently, does not take the context of these A-scans into account.However, the classification branch C 1 does: by training C 1 • Π jointly, Π can be trained to extract information allowing localization.For instance, we can see that color in Π varies with the retinal thickness (see Fig. 6 and 7): this could be useful to capture pathological features (like macular edemas), but certainly also to localize them relatively to the normal retinal features.Indeed, the clinicians collaborating in this study affirm that this method markedly enhances the visibility of retinal lesions, as shown in Fig. 8.As it is apparent, pathological features are well preserved and highlighted, while additional details from the B-scans complement the classifier, reinforcing the need for automated B-scan selection.
To validate the usefulness of proposed 3-D→2-D projections for decision support, the next step will be to compare the classification performance of clinicians when they use these projections for decision support versus when they do not.Although it is clear that the relevant abnormal structures stand out well in these images, the color-coding of retinal features derived from these projections may be disturbing for some clinicians.For instance, Fig. 6 leads to unusual nebula-like images using ensemble 2 (column 4).For such a validation study, it may be useful to reorder color channels before human inspection, for instance, to produce more conventional-looking images.However, these considerations are out of the scope of an artificial intelligence paper.
One important property illustrated in Fig. 6 and 7 is that the same features (normal anatomical features like blood vessels, or pathological features like DR lesions) have consistent appearances in 2-D projections across patients (i.e., across lines of these figures).However, the appearance of these features clearly depends on the architecture of the projection operator (i.e., their appearance varies across columns of these figures).Similarly, their appearance depends on the weights trained for the projection operator.If one decides to retrain or fine-tune the proposed framework on a larger dataset, it may be beneficial to freeze the projection weights and only retrain the classification ensembles.This will ensure that the proposed 2-D visualization remains standardized.

Detailed Analysis of the Framework
Tab. 2 demonstrates that the first classification branch (C 1 ), which analyzes the 3-D→2-D projection, contributes the most to the final classification.This suggests that the proposed 3-D→2-D projections contain most of the discriminant information contained in 3-D OCTA acquisitions for the target classification task, which is good news for interpretability purposes: one single 2-D image conveys most of the relevant information.However, we have shown that combining the two branches leads to a significant increase in classification performance (p = 0.00015, see Table 2).The superiority of branch 1 is particularly true in the two-step training schedule, which was adopted: at first, branch 1 is trained independently, to maximize classification performance; branch 2 (C 2 ), which analyzes selected B-scans, is only trained afterward, to be complementary to the frozen branch 1.In other words, branch 2 was not trained to be discriminant on its own.An ablation study was performed to analyze the benefits of most methodological choices in the proposed design.Table 3 suggests that classification performance is impacted by the complexity of the projector, driven by parameter Φ, but no significant difference was found with more than Φ = 8.Next, the U-Net-like projector and the absence of model dropout, which clearly affect the quality of en-face projections, also decrease classification performance significantly.The use of ImageNet pre-trained weights and logit averaging also proved beneficial.Table 4 suggests that the performance of the second classification branch is little dependent on the attribution method used.As for the random selection of B-scans during training, it seems beneficial, but no significant difference was observed.The most influential parameter in Table 4 is the choice between one-step and two-step training (p=0.0004).The two-step approach has the advantage of more stable training; in particular, one instance of training divergence was observed with the one-step approach.This may explain the increased performance with two steps.On the downside, two-step training implies longer training times (by a factor of two, approximately).Therefore, we recommend investigating both approaches when replicating these results.

Comparison with Previous Algorithms
Although DR severity assessment using OCTA is a recent topic, a few classification results have already been published, using a variety of DR severity cutoffs; those results are reported in Tab. 6.It should be noted that they were generally obtained on small datasets and that data collection often included a data selection process based on image quality.For instance, Ryu et al. (2021) imaged 496 eyes, but only 360 scans were retained for further analysis, indicating a rejection rate of 27% (or more, if some patients were imaged more than once).The number of images rejected for quality reasons is not always mentioned: for instance, Le et al. (2020) only indicated a quality threshold.These factors make comparisons between algorithms challenging.However, it appears that the proposed framework allows similar or better classification results than previously published algorithms, regardless of the type of analysis (2-D CNN, 3-D CNN, or radiomics).Two tasks are particularly well addressed by our framework: ≥ mild NPDR detection (AUC = 0.958) and ≥ PDR detection (AUC = 0.957).It should be noted that, besides results from our team (Li et al., 2023), on a subset of the presented dataset, this is the first publication about ultrawidefield OCTA acquisitions (15x15mm 2 ) on the topic.This type of acquisition seems promising for detecting advanced DR stages.In order to thoroughly assess the performance of our proposed method, we have also conducted a direct comparison with a baseline 3-D CNN model, on the same test set (see section 4.7).We have demonstrated the superiority of the proposed DIS-COVER framework over that baseline, both in terms of classification performance and in terms of inference times (see Table 5).In detail, when comparing ROC curves using Delong tests, a significant difference was found for 15x15mm 2 acquisitions, but not for 6x6mm 2 acquisitions (p=0.1164).However, when looking solely at the AUC, using a Wilcoxon signed-rank test, a Figure 10: Receiver Operating Characteristic of the proposed system in the test set using EfficientNet-{B0, B1, B2, B3} (ensemble or {ConvNeXt-base, Efficient-v2, SEResNet-152} (ensemble 2) for 6×6mm 2 and 15×15mm 2 acquisitions.significant difference was found overall in favor of DISCOVER.As for inference times, they are up to 20 times faster using DIS-COVER.We believe that our proposed pipeline, which employs a 3-D→2-D summarization in conjunction with a 2-D classification, surpasses the performance of direct 3-D classification models, such as the baseline, for several reasons: 1.By incorporating the strengths of both en-face projections and cross-sectional slices, our method captures more pertinent information from the OCTA volumes.2. Our method features a lighter architecture compared to 3-D neural networks, resulting in a reduced propensity for overfitting and increased adaptability to smaller datasets.The utilization of pretrained 2-D backbones further bolsters this advantage.3. The end-to-end training approach allows our pipeline to acquire more discriminative features, thereby enhancing DR severity assessment.

Limitations and Future Works
This study has a few limitations.First, due to long training times and high resource consumption, no cross-validation or advanced hyperparameter optimization strategy (like Bayesian optimization) was adopted, so chances are that hyperparameter optimization is suboptimal.Thus, in the test set, we observe a better performance with Φ = 32 initial filters for two classification labels (≥ severe NPDR and ≥ PDR), compared to the hyperparameter value Φ = 8 selected in the validation set (see Table 3).A second limitation is that the dataset is limited in size, which impacts both training and performance evaluation.A third limitation is the use, for training and evaluation, of a DR severity scale known to be suboptimal for DR management due to its limited predictive power.
Addressing the latter two points is the purpose of the Evired project 8 : we are collecting longitudinal data from thousands of diabetic patients to define a more predictive DR severity scale.The end goal will be to predict the advent of two DR complications in the following year: proliferative DR and DME.Therefore, ultimately, the proposed framework will be trained to solve a novel 2-label prognosis task (N = 2).The inter-ing interests: Pierre Deman, Employee (ADCIS), Laurent Borderie, Employee (Evolucare Technologies), Hugang Ren and Niranchana Mannivanan, Employees (Carl Zeiss Meditec), Béatrice Cochener and Ramin Tadayoni, Consultant (Carl Zeiss Meditec).
generated high-quality 2-D projections from lowquality 3-D fluorescence microscopy images, which can be acquired fast.In biology, Haertter et al. (2022) projected curved 2-D manifolds from 3-D microscopy image stacks on a 2-D plane.Solutions to this problem are supervised: Haertter et al. (2022) use a U-Net-like structure, Forsgren et al. (2022) use conditional GANs with a U-Net-like backbone.

Figure 1 :
Figure 1: Geometry of an Optical Coherence Tomography (OCT) acquisition.A 3-D B-scan consists of multiple 2-D B-scans, which in turn consist of multiple 1-D A-scans.

Figure 2 :
Figure 2: Overview of the proposed approach.A multi-channel 3-D volume is summarized as a 2-D image through a 3-D→2-D projection network, detailed in Fig. 4. Next, a first classification branch classifies this 2-D summary image in order to produce a DR diagnosis.Through an attribution method, the most relevant 2-D B-scans are selected.Then, a second classification branch classifies the selected B-scans to improve the DR diagnosis.Each classification branch is an ensemble of classifiers, detailed in Fig. 5.In this figure, each 3-D channel in the input volume is represented by its Maximum Intensity Projection (MIP).
Figure 3: Preprocessing pipeline for OCTA acquisitions (see section 3.2) illustrated on one B-scan.Each original 2-D flow and structure B-scan is flattened, masked out, and cropped.The original 2-D LSO en-face localizer is transformed into a 3-D volume by duplicating pixel intensities along the depth axis (within the masked region).A 3-channel 3-D volume is obtained by stacking the resulting three volumes (flow, structure, LSO).

Figure 4 :Figure 5 :
Figure 4: Architecture of the 3-D → 2-D projection network, detailed in section 3.3.The input is the preprocessed acquisition, a 3-channel volume of X × Y 1 × Z voxels, with Y 1 = 224 in this example.Parameter Φ, the number of filters in the first block, controls the complexity of the projection network.The figure on the right illustrates the size of the data tensors at the output of each block.

Figure 6 :Figure 7 :Figure 8 :
Figure 6: Examples of projections for 6x6mm 2 acquisitions from the test set.The projector has Φ = 8 filters in the first block and the classifiers are EfficientNet-{B0, B1, B2, B3} for ensemble 1 and {ConvNeXt-base, Efficient-v2, SEResNet-152} for ensemble 2. The first row represents an eye graded as moderate NPDR.The second row represents an eye graded as moderate NPDR with macular edema.The third row represents an eye graded as proliferative DR.

Table 1 :
Dataset distribution among training, validation, and test sets for PLEX Elite 6x6mm 2 and PLEX Elite 15x15mm 2 Table 2 (a)(2 acquisitions × 2 CNN ensembles × 4 decisions).Wilcoxon signed-rank tests were performed to compare two scenarios by confronting the corresponding 16 pairs of AUC values: results are reported in Table 2 (b).

Table 3 :
Lachinov et al. (2021)yperparameters on the area under the ROC curve (AUC) in the test set -analysis of the first classification branch C 1 only (no B-scan selection).Experiments were performed for 6×6mm 2 acquisitions and the EfficientNet-{B0, B1, B2, B3} backbones (ensemble 1).The first line corresponds to the optimal values (based on experiments on the validation set).Investigated parameters are in bold.U-Projection denotes the U-Net-like projection proposed byLachinov et al. (2021).* Significant difference with the reference (first line), according to a Delong test on the micro-averaged ROC curves (p < 0.05).

Table 4 :
Influence of various hyperparameters on the AUC in the test set -analysis of the second classification branch C 2 (jointly with C 1 ).Experiments were performed for 6×6mm 2 acquisitions and the EfficientNet-{B0, B1, B2, B3} backbones (ensemble 1).The first line corresponds to the optimal values (based on experiments on the validation set).Investigated parameters are in bold.Argmax B-scan selection refers to a deterministic selection of B-scans during training, like during inference (see Eq. (

Table 5 :
Comparison between the proposed DISCOVER framework and a 3-D CNN baseline in terms of classification performance in the test set (AUC) and in terms of inference times.Inference times are given in seconds per volume, excluding preprocessing (see section 3.2), which is common for both frameworks.For information, inference times are 0.130 seconds/volume for ensemble 2 on 6x6mm 2 acquisitions and 0.187 seconds/volume for ensemble 1 on 15x15mm 2 acquisitions.* Significant difference between the baseline and the proposed approach, according to a Delong test on the micro-averaged ROC curves or according to a Wilcoxon signed-rank test on the paired AUC values (p < 0.05).