Deep Learning in Brain MRI: Effect of Data Leakage Due to Slice-level Split Using 2D Convolutional Neural Networks

In recent years, 2D convolutional neural networks (CNNs) have been extensively used for the diagnosis of neurological diseases from magnetic resonance imaging (MRI) data due to their potential to discern subtle and intricate patterns. Despite the high performances reported in numerous studies, developing CNN models with good generalization abilities is still a challenging task due to possible data leakage introduced during cross-validation (CV). In this study, we quantitatively assessed the effect of a data leakage caused by 3D MRI data splitting based on a 2D slice-level using three 2D CNN models for the classication of patients with Alzheimer’s disease (AD) and Parkinson’s disease (PD). Our experiments showed that slice-level CV erroneously boosted the average slice level accuracy on the test set by 30% on Open Access Series of Imaging Studies (OASIS), 29% on Alzheimer’s Disease Neuroimaging Initiative (ADNI), 48% on Parkinson's Progression Markers Initiative (PPMI) and 55% on a local de-novo PD Versilia dataset. Further tests on a randomly labeled OASIS-derived dataset produced about 96% of (erroneous) accuracy (slice-level split) and 50% accuracy (subject-level split), as expected from a randomized experiment. Overall, the extent of the effect of an erroneous slice-based CV is severe, especially for small datasets.

they achieved 100% accuracy on the test set. In another study by Sivaranjini and Sujatha 26 , a pre-trained 2D CNN AlexNet architecture was used to classify PD patients vs. healthy controls, resulting in an accuracy of 88.9%.
Although very good performances have been shown by using deep learning for the classi cation of neurological disorders, there are still many challenges that need to be addressed, including complexity and di culty in interpreting the results due to highly nonlinear computations, non-reproducibility of the results, and data/information and, especially, data over tting (see Vieira et al. 20 and Davatzikos 16 for reviews).
Overly optimistic results may be due to data leakage -a process caused by incorporating information of test data into the learning process. While concluding that data leakage leads to overly optimistic results will surprise few practitioners, we believe that the extent to which this is happening in neuroimaging applications is mostly unknown, especially in small datasets. As we completed this study, we became aware of independent research by Wen et al. 27 that corroborates part of our conclusions regarding the problem of data leakage. They successfully suggested a framework for reproducible assessment of AD classi cation methods. However, the architectures have not been trained and tested on smaller datasets typical of clinical practice and they mainly employed hold-out model validation strategies, rather than cross-validation (CV) -that gives a better indication of how well a model performs on unseen data 28,29 . Moreover, the authors focused on illustrating the effect of data leakage on the classi cation of AD patients only.
Unfortunately, the problem of data leakage incurred by incorrect data split is not only limited within the area of AD classi cation but can also be seen in various neurological disorders. It is more common to observe the data leakage in 2D architectures, yet some forms of data leakage, such as late split, could be present in 3D CNN studies as well. Moreover, although deep complex classi ers are more prone to over tting, there is no reason to believe that conventional machine learning algorithms could not be affected by data leakage. A summary of these works with clear and potential data leakage is given in Tables 1 and 2, respectively. Other works with insu cient information to assess data leakage are reported in Table 3.   In this study, we addressed the issue of data leakage in one of the most common classes of deep learning models, i.e., 2D CNNs, caused by incorrect dataset split of 3D MRI data. Speci cally, we quanti ed the effect of data leakage on CNN models trained on different datasets of T 1 -weighted brain MRI of healthy controls and patients with neurological disorders using a nested CV scheme with two different data split strategies: a) subject-level split, avoiding any form of data leakage and b) slice-level split, in which different slices of the same subject are contained both in the training and the test folds (thus data leakage will occur). We focused our attention on both large (about 200 subjects) and small (about 30 subjects) datasets to evaluate a possible increase of performance overestimation when a smaller dataset was used, as is often the case in clinical practice. This paper expands on the preliminary results by Yagis et al. 30 offering a broader investigation on the issue. In particular, we performed the classi cation of AD patients using the following datasets: 1) OASIS-200, consisting of randomly sampled 100 AD patients and 100 healthy controls from the OASIS-1 study 31

Results
For AD classi cation, accuracies on the test set, using subject-level CV, were below 71% for large datasets (OASIS-200 and ADNI), whereas they were below 59% for smaller datasets (OASIS-34). As regards de novo PD classi cation, they were around 50% for both large (PPMI) and small (Versilia) datasets. Conversely, in all datasets, slice-level CV erroneously produced very high classi cation accuracies on the test set (higher than 94% and 92% on large and small datasets, respectively), leading to deceptive, over-optimistic results ( Table 4).
The worst-case stemmed from the randomly labeled OASIS dataset, which resulted in a model with unacceptably high performances (accuracy on the test set more than 93%) using slice-level CV, whereas classi cation results obtained using a subject-level CV were about 50%, in accordance with the expected outcomes for a balanced dataset with completely random labels. Table 4 Mean slice-level accuracy on the training and test set of the outer CV over 5-fold nested CV has been reported for three 2D CNN models (see "Methods" section), all datasets and two data split methods (slice-level and subjectlevel). The difference between accuracy using slice-level and subject-level split in the test set has also been reported.

Discussion
In this study, we quantitatively assessed the extent of the overestimation of the model's classi cation performance caused by the adoption of an incorrect slice-level CV, which is unfortunately frequent in neuroimaging literature (see Tables 1-3). More speci cally, we showed the performance of three 2D CNN models (two VGG variants and one ResNet-18, see "Methods" section) trained with subject-level and slice-level CV data splits for the classi cation of AD and PD patients from healthy controls using T 1 -weighted brain MRI data. Our results revealed that pooling slices of MRI volumes for all subjects and then dividing randomly into training and test set leads to signi cantly in ated accuracies (in some cases from barely above chance level to about 99%). In particular, slice-level CV erroneously increased the average slice level accuracy on the test set by 40-55% on smaller datasets (OASIS-34 and Versilia) and 25-45% on larger datasets (OASIS-200, ADNI, PPMI). Moreover, we also conducted an additional experiment in which all the labels of the subjects were fully randomized (OASIS-random dataset). Even under such circumstances, using the slice-level split, we achieved an erroneous 95% classi cation accuracy on the test set with all models, whereas we found 50% accuracy using a subject-level data split, as expected from a randomized experiment. This large (and erroneous) increase in performance could be due to the high intra-subject correlation among T 1 -weighted slices, resulting in a similar information content present in slices of the same subject 34 .
In AD classi cation, three previous studies 21,22,35 , using similar deep networks (VGG16, ResNet-18 and LeNet-5, respectively), reported higher classi cation accuracies (92.3%, 98.0% and 96.8%, respectively) than ours. However, there is a strong indication that these performances are massively overestimated due to a slice-level split. In particular, in one of these works 21 , the presence of data leakage was further corroborated by the source code accompanying the paper and con rmed by our data. In fact, when we used the same dataset of Hon and Khan 21 (OASIS-200 dataset), our VGG16 models achieved only 66% classi cation accuracy with subject-level split, whereas they boosted to about 97% with a slice-level split. Similar ndings were presented by Wen et al. 27 who used an ADNI dataset with 330 healthy controls and 336 AD patients. Indeed, using baseline data, they reported a 79% of balanced accuracy in the validation set with a subject-level split which increased up to 100% with a slice-level split.
One of the main issues in the classi cation of neurological disorders using deep learning is data scarcity 36 . Not only because labeling is expensive, but also because privacy reasons and institutional policies make acquiring large sets of labeled imaging data even more challenging 37 . To show the impact of data size on model performance, we created 10 small subsets from the OASIS dataset (OASIS-34 datasets). As expected, when we reduced the data, we obtained lower classi cation accuracies with all the networks using the subject-level data split method. However, when the slice-level method was used, the models achieved erroneous better results on OASIS-34 than on the OASIS-200 dataset. Similarly, models trained on the Versilia dataset (34 subjects) produced in ated results with the slice-level split. Overall, these results point out that the data leakage is extremely relevant especially when small datasets are used -a situation which may be unfortunately common in clinical practice.
It is well-known that data leakage leads to in ating performance. Nevertheless, the degree of overestimation quanti ed through our experiments was surprising. Unfortunately, in the literature, the precise application of CV is frequently not well-documented and the source code is not available. This situation leaves the neuroimaging community unable to trust the (sometimes) promising results published. Regardless of the network architecture, the number of subjects, and the level of complexity of the classi cation problem, all experiments that applied slice-level CV yielded very high classi cation accuracies on the test set as a result of incorporating different slices of the same subject in both the training and test sets. Considering classi cations on 2D MRI images, to prevent data leakage and to get trustable results, we showed that it is crucial that the CV split be done based on the subject-level. This assures the training and validation sets to be completely independent and con rms that no information is leaking from the test set into the training set during the development of the model. With recent advances in machine learning, more and more people are becoming interested in applying these techniques to biomedical imaging, and there is a real and growing risk that many of them will not be familiar with the possible issues and good practices. We also emphasize the need for documenting how the CV is implemented, the architecture used, how the different hyperparameter choices/tunings are made, and also include their values where possible. Besides, it would be also necessary to make the source codes available to the neuroimaging community so that the results will be reproducible 38 .
The main limitation of this study is that we have not assessed all data leakage types, including late split and hyperparameters optimization in the test set -that may be also both present in 3D CNN studies. A late split occurs when the data augmentation step is performed before separating the test set from the training data. In that case, the augmented data generated from the same original image can be seen in both training and test data, leading to in ated performance 27 . Still, using the same test set for optimizing the training hyperparameters as well as evaluating the model performance is an additional form of data leakage 39 . Finally, data leakage also occurs when feature selection is performed based on the whole dataset before carrying out cross-validation 39,40 . We have found evidence of all these data leakage issues in the recent literature (see Tables 1-3) and we plan to systematically quantify their effect in our future work. Another limitation may be due to the substantial over tting we observed.
Indeed, using more e cient models, by searching for best architectures and exploring a wider hyperparameter space, better performances in the test set would have been achieved, and hence the boost in performance due to the slice-level split would be likely reduced.
In conclusion, training a 2D CNN model for analyzing 3D brain image data should be performed using a subjectlevel CV to prevent data leakage. The adoption of slice-based CV results in very optimistic model performances, especially for small datasets, as the extent of the overestimation due to data leakage is severe.

Datasets
In this study, we adopted the scans collected by three public and international datasets of T 1 -weighted images of patients with AD (the OASIS dataset 31 and the ADNI dataset 32 ) and de-novo PD (the PPMI dataset 33 ). An additional private de-novo PD dataset, namely the Versilia dataset, has also been used. A summary of the demographics of the datasets used in this study is shown in Table 5. In the following sections, a detailed description of all datasets will be reported. In OASIS-1, AD diagnosis, as well as the severity of the disease, were evaluated based on the global Clinical Dementia Rating (CDR) score derived from individual CDR scores for the domains memory, orientation, judgment and problem solving, function in community affairs, home and hobbies, and personal case 41,42 . Subjects with a global CDR score of 0 have been labeled as healthy controls, while scores 0.5 (very mild), 1 (mild), 2 (moderate), and 3 (severe) have been all labeled as AD.
All T 1 -weighted images have been acquired on a 1.

PPMI dataset
We randomly selected 100 de-novo PD subjects (40 women and 60 men, age 61.71 ± 9.99, mean ± SD) and 100 healthy controls (36 women and 64 men, age 61.91 ± 11.52, mean ± SD) from the publicly available PPMI dataset (https://ida.loni.usc.edu/login.jsp?project=PPMI). No signi cant difference in age (p = 0.44 at t-test) and gender (p = 0.56 at χ 2 -test) was found between the two groups. The criterion used to recruit de-novo PD patients, and healthy controls were de ned by Marek and colleagues 33 . Brie y, PD patients were selected within two years of diagnosis with a Hoehn and Yahr score < 3 43 , at least two of resting tremor, either bradykinesia or rigidity (must have either resting tremor or asymmetric bradykinesia) or a single asymmetric resting tremor or asymmetric bradykinesia and dopamine transporter (DAT) or vesicular monoamine transporter type 2 (VMAT-2) imaging showing a dopaminergic de cit. Healthy controls were free from any clinically signi cant neurological disorder 33 .
The T 1 -weighted scans were collected at baseline using MR scanners manufactured by Siemens (11 scanners at 3 T and 5 scanners at 1.5 T), Philips Medical Systems (10 scanners at 3 T and 11 scanners at 1.5 T), GE Medical Systems (11 scanners at 3 T and 24 scanners at 1.5 T) and another anonymous one (5 scanners at 1.5 T). We also found three subjects whose MRI protocol was missing. The details of the MRI protocols of all scanners can be found in Supplementary

Versilia dataset
Seventeen (4 women and 13 men, age 64 ± 7.21 years, mean ± SD) patients with de-novo parkinsonian syndrome consecutively referred to a Neurology Unit for the evaluation of PD over a 24-month interval (from June 2012 to June 2014) were recruited in this dataset. More details about clinical evaluation can be found in 44 . Seventeen healthy controls (5 women and 12 men, age 64 ± 7 years, mean ± SD) with no history of neurological diseases and normal neurological examination were recruited as controls. No signi cant difference in age (p = 0.95 at t-test) and gender (p = 0.70 at χ 2 -test) was found between the two groups.
All subjects underwent high-resolution 3D T 1 -weighted imaging on a 1.5 T

T 1 -weighted MRI data pre-processing
All T 1 -weighted MRI data went through two preprocessing steps (see Fig. 1). In the rst stage, co-registration to a standard template space and skull stripping were applied to geometrically re-align all the images and remove nonbrain regions. In the second stage, the collection of a subset of axial images using an entropy-based slice selection approach has been carried out.

Co-registration to a standard template space and skull stripping
For the OASIS datasets, we used publicly available pre-processed data (gain-eld corrected, brain masked, and coregistration) 45 . Brie y, the brain masks from OASIS were obtained using an atlas-registration-based method, and their quality was controlled by human experts 31 and each volume has been co-registered to the Talairach and Tournoux atlas. Each pre-processed T 1 -weighted volume had a data matrix size of 176 × 208 × 176 and a voxel size For all other datasets, we have co-registered each individual T 1 -weighted volume to the MNI152 standard template space (at 1 mm voxel size -available in the FSL version 6.0.3 package) by using the SyN algorithm included in ANTs package (version 2.1.0) with default parameters 46 . Then, the brain mask of the standard template space has been applied to each co-registered volume. Each pre-processed T 1 -weighted volume had a data matrix size of 182 × 218 × 182 and a voxel size of 1 mm × 1 mm × 1 mm.

Entropy-based slice selection
Each T 1 -weighted slice generally conveys a different amount of information. Given that we are interested in developing a 2D CNN model, we have performed a preliminary slice selection based on the amount of information, retaining, for each T 1 -weighted volume, only the eight axial slices that showed the highest entropy 21 . Speci cally, for a slice with k grayscale levels and with each gray level having a probability of occurrence p k (estimated as its relative frequency in the image), the Shannon entropy E S was computed as: (1) To be consistent with the input sizes of the proposed 2D CNN models, all slices were resized to 224 × 224 pixels by tting a cubic spline between the 4-by-4 neighborhood pixels 47 . Voxel-wise feature standardization has also been applied to make training the CNNs easier and to achieve faster convergence, i.e., for each voxel, an average value of all grayscale values within the brain mask has been subtracted and scaled by the standard deviation (within the brain mask).

Model architectures
Since the number of subjects of each dataset may not be su cient to train with high accuracy a 2D CNN model from scratch, we have used a machine learning technique called transfer learning that allows employing pre-trained models, i.e., model parameters previously developed for one task (source domain) to be transferred to target domain for weight initialization and feature extraction. In particular, CNN layers hierarchically extract features starting from the general low-level features to those speci c to the target class, and, using transfer learning, the general low-level features can be shared across tasks. Notably, in this study, we used pre-trained VGG16 48 and ResNet-18 49 models, as detailed in the following sections. The transfer learning approach and VGG16 architectures used in this study are similar to those employed in 21 as their results triggered our investigation of data leakage.

VGG16-based models
VGG16 is one of the most in uential architectures which explores network depth with very small (3x3) convolution lters stacked on top of each other. VGG16 consists of ve convolutional blocks, with alternating convolutional and pooling layers, and three fully-connected layers.
In transfer learning, the most common approach is copying the rst n layers of the pre-trained network to the rst n layers of a target network, and then randomly initializing the remaining layers to be trained on the target task. Depending on the size of the target dataset and the number of parameters in the rst n layers, these copied features can be left unchanged (i.e., frozen) or ne-tuned during the training of the network on a new dataset. It is well accepted that if the target dataset is relatively small, ne-tuning may cause over tting, whereas if the target dataset is large, then the base features can be ne-tuned to improve the performance of the model without over tting.
To investigate the effect of ne-tuning, we have tested two different variants of VGG16 architecture, namely VGG16-v1 and VGG16-v2 (Fig. 2). The former model has been used as a feature extractor where the weights for all layers of the network are frozen except that of the nal fully connected layer. The three topmost layers have been replaced by randomly initialized fully connected layers with recti ed linear unit (ReLU) activation. The weights are initialized according to the Xavier initialization heuristic 50 to prevent the gradients from vanishing or exploding.
The VGG16-v2 model has been utilized as a weight initializer where the weights are derived from the pre-trained network and ne-tuned during training. We have replaced the fully connected layers with a randomly initialized global average pooling (GAP) layer suggested by Lin and colleagues 51 to reduce the number of parameters and, rather than freezing the CNN layers, we have ne-tuned all layers.

ResNet-18 based model
It has been long believed that deeper networks can learn more complex non-linear relationships than shallower networks with the same number of neurons, and thus network depth is of great importance on model performance 52 . However, many studies revealed that deeper networks often converge at a higher training and test error rate when compared to their shallower counterparts 49 . Therefore, stacking more layers to the plain networks may eventually degrade the model's performance while complicating the optimization process. To overcome this issue, He and colleagues introduced deep residual neural networks and achieved top-5 test accuracies with their models on the popular ImageNet test set 49 . The model was proposed as an attempt to solve the vanishing gradients and the degradation problems using residual blocks. With these residual blocks, the feature of any deeper unit can be computed as the sum of the activation of a shallower unit and the residual function. This architecture causes the gradient to be directly propagated to shallower units making ResNets easier to train.
There are different versions of ResNet architecture with various numbers of layers. In this work, we used ResNet-18 architecture, which is an 18-layer residual deep learning network consisting of ve stages, each with a convolution and identity block 49 . In our model, one fully connected layer with sigmoid activation has been added at the end of the network -a common practice in binary classi cation tasks as it takes a real-valued input and squashes the output to a range between 0 and 1. Since the network is relatively smaller and has a lower number of parameters than VGG16, the weights and biases of all the transferred layers are ne-tuned while the newly added fully connected layer has been trained starting from randomly initialized weights. The architecture of our ResNet-18 model can be seen in Fig. 3.

Model training and validation
Each 2D CNN model has been trained and validated using a nested CV strategy -a validation scheme that allows examining the unbiased generalization performance of the trained models along with performing, at the same time, hyperparameters optimization 39 . It involves nesting two k-fold CV loops where the inner loop is used for optimizing model hyperparameters, and the outer loop gives an unbiased estimate of the performance of the best model. It is especially suitable when the amount of data available is insu cient to allow separate validation and test sets 39 . A schematic diagram of the procedure is illustrated in Supplementary Fig. S2 (Supporting Information). It starts by dividing the dataset into k folds and one-fold is kept as a test set (outer CV), while the other k-1 folds are split into inner folds (inner CV). The model hyperparameters are chosen from the hyperparameter space through a grid search based on the average performance of the model over the inner folds. In particular, we varied the learning rate in the set {10 − 5 , 3 × 10 − 5 , 10 − 4 , 3 × 10 − 4 , 10 − 3 } and the learning rate decay in {0, 0.1, 0.3, 0.5}. The chosen model is then tted with all the outer fold training data and tested on the unseen test fold, resulting in an unbiased estimation of the model's prediction error. Speci cally, we choose a 10-fold CV because it offers a favorable biasvariance tradeoff 53,54 .
In all experiments, we used batch size = 128 and epoch number = 50. Due to its ability to adaptively updating individual learning rates for each parameter, an Adam optimizer was used 55 . Each selected slice of the 3D T 1 -weighted volume has been classi ed independently and the nal model's performance was stated using the mean slice-level accuracy, separately, on the training set and test set folds of the outer CV.
We thus conducted CNNs model's training and validation on each dataset in a nested CV loop using two different data split strategies: a) subject-level split, in which all the slices of a subject have been placed either in the training set or in the test set, avoiding any form of data leakage; b) slice-level split, in which all the slices have been pooled together before CV, then split randomly into training and test set. In this case, for each slice of the test set, a set of highly correlated slices coming from the MR volume of the same subject ended up in the training set, giving rise to data leakage, as shown pictographically in Fig. 1.
CNN models were carried out using a custom-made software in Python language (version 3.6.8) using the following Schematic diagram of the overall T1-weighted MRI data processing and validation scheme. First, a pre-processing stage included co-registration to a standard space, skull-stripping and slices selection based on entropy calculation. Then, CNNs model's training and validation have been performed on each dataset in a nested CV loop using two different data split strategies: a) subject-level split, in which all the slices of a subject have been placed either in the training or in the test set, avoiding any form of data leakage; b) slice-level split, in which all the slices have been pooled together prior to CV, then split randomly into training and test set.

Figure 2
The two different networks based on the VGG16 architecture are shown. Each colored block of layers illustrates a series of convolutions. (a) The rst model, named as VGG16-v1 consists of ve convolutional blocks followed by three fully connected layers. Only the last three fully connected layers are ne-tuned. (b) On the other hand, the second model, VGG16-v2, has 5 convolutional blocks followed by a global average pooling layer and all the layers are ne-tuned.