Boosting multiple sclerosis lesion segmentation through attention mechanism

Magnetic resonance imaging is a fundamental tool to reach a diagnosis of multiple sclerosis and monitoring its progression. Although several attempts have been made to segment multiple sclerosis lesions using artificial intelligence, fully automated analysis is not yet available. State-of-the-art methods rely on slight variations in segmentation architectures (e.g. U-Net, etc.). However, recent research has demonstrated how exploiting temporal-aware features and attention mechanisms can provide a significant boost to traditional architectures. This paper proposes a framework that exploits an augmented U-Net architecture with a convolutional long short-term memory layer and attention mechanism which is able to segment and quantify multiple sclerosis lesions detected in magnetic resonance images. Quantitative and qualitative evaluation on challenging examples demonstrated how the method outperforms previous state-of-the-art approaches, reporting an overall Dice score of 89% and also demonstrating robustness and generalization ability on never seen new test samples of a new dedicated under construction dataset.

Medical Image Analysis, In Silico Trials (IST)

Introduction
Multiple Sclerosis (MS) is a chronic inflammatory demyelinating disease of the Central Nervous System (CNS) [1], with neuropathologic features characterized by focal areas of inflammation with myelin and axonal loss. MS lesions may be detected in vivo by Magnetic Resonance Imaging (MRI) in different areas of the brain and the spinal cord and they accumulate over time [2]. Selective localization of lesions on MRI (periventricular, cortical/iuxtacortical, brain stem/cerebellar, and spinal cord) is also relevant for the diagnosis of MS and detection of new or enlarging lesions at follow-up, and is routinely used in the evaluation of therapeutic response and disease progression [3,4]. Manual annotation of MS lesions on MRI scans is a time consuming task and requires substantial efforts by specialized experts. Moreover, inter and intra operator variability is unavoidable and may affect accuracy and reproducibility of lesion segmentation [5]. Thus, there is an increasing interest today in automation of MRI reading and evaluation to avoid the bias introduced by human raters and to make this information available for routine clinical practice [6]. Typically, the longitudinal brain MRI protocol involves distinct kinds of sequences, which generate different types of images that vary according to the contrast of the various tissues that compose the brain. The most common MRI sequences used to detect MS lesions are the Fluid Attenuated Inversion Recovery (FLAIR), T1-weighted, T2-weighted, and PD-weighted images. In the T1-weighted sequence, white matter appears lighter than gray matter, and cerebrospinal fluid (CSF) appears dark. In the T2-weighted sequence, the white matter appears darker than the gray matter, while the CSF appears bright. FLAIRs images are like T2s, except that CSF is suppressed. MS lesions appears hypointense in T1-w and hyperintense in T2-w, PD-w and FLAIR sequences, with respect to normal tissue intensities. Lesions are most detectable in the FLAIR images, where they appear hyperintense and usually well distinguishable from surrounding tissues. Figure 1 shows four MRI brain images for each different acquisition types with MS lesions (MS lesions are pointed by red circle). In our method, similarly to other works in the field [7,8,9], the most discriminating MRI sequence (FLAIR) was exploited.
The starting point of our study is one of the most widely used networks in the state of the art for this purpose, the U-Net architecture [10], which is widely used not only in medical image segmentation, but also in general segmentation tasks. In this work, an extended Fully Convolutional DenseNet (FC-DenseNet) [11] for MS lesion segmentation is proposed; it follows the U-Net structure [10] with the addiction of Long Short-Term Memory (LSTM) layer and extensive usage of attention mechanisms to detect FLAIR-w MS lesions in longitudinal brain MRI. Attention [12] is a technique that aims to mimic the cognitive attention of humans by enforcing neural networks to pay greater attention to most informative input data and ignore the rest. Attention mechanisms have been shown to be effective in capturing global dependencies and have become an integral part of semantic segmentation tasks [13]. The FC-DenseNet has been properly extended with an attention mechanism based on the usage of squeeze and attention blocks [14], in order to accentuate the group of pixel from the same classes employing different spatial scales. Squeeze and Attention blocks (SA) represent a component that can be easily incorporated within the backbone, able to improve network performance through operations applied on both local and global level. Moreover, the space propagation of the lesions with a similar shape between adjacent images suggested to introduce a Long shortterm memory (LSTM) layer [15]; it permits to preserve spatial information between longitudinal axis of data. The performance of the proposed architecture was evaluated employing a cross-validation scheme in patients with lesions on follow-up scans. The architecture described in Figure 4 represents the best results of some tested models described in the ablation studies (Section 4.5).
The training phase of a Deep Neural Network architecture typically requires a large amount of labeled images [16]. A relevant issue in MRI lesions segmentation is the presence of just few small example in each dataset available [6] and the lack of homogeneity between different repositories, due to the usage of different scanners and/or acquisition protocols. This makes the segmentation challenging, raising concerns about the results obtained from the different methods, which are difficult to compare and generalize to other datasets. For these reasons, we are actually working on the generation of a new labeled MRI dataset as part of the "In Silico World (ISW): Lowering barriers to ubiquitous adoption of In Silico Trials" (Grant agreement ID: 101016503, PROGRAMME: H2020-EU.3.1. -SOCIETAL CHALLENGES -Health, demographic change and well-being, CALL: H2020-SC1-DTH-2018-2020). It will be larger than the actual ones, with heterogeneous samples (patients with different stages of disease) and with labeled MS lesions validated by employing different experts. The proposed method, its future extensions and the under construction labeled dataset will be included into the Universal Immune System Simulator (UISS). UISS is a multi-compartment, multiscale, polyclonal, stochastic, and patient-specific agent-based model (ABM) that is able to simulate immune system dynamics both in physiological and pathological scenarios [17]. UISS simulator framework has been extended to model MS pathogenesis and its interaction with the host immune system [18], taking into account both cellular and molecular entities. Particularly, UISS-MS takes into account B cells, T helper (CD4+ T cells), T cytotoxic (CD8+ T cells), conventional dendritic cells (DCs), macrophages (M), plasma B cells (P cells), immunocomplexes (IC), oligodendrocytes (ODC), interferongamma (IFN-G), interleukins of type x (IL-x), transforming growth factor beta (TGFB), myelin basic proteins (MBP), immunoglobulins class G (IgG) and chemokines (as generic chemokines) [19]. For each modeled patient, the age at MS onset, baseline MRI lesion load, oligoclonal bands status, and the administered treatment are usually considered.
A limit of the current UISS-MS framework is that only qualitative data about MRI lesion load have been inserted [18]. For this reason, the quantitative data about the MRI lesion load obtained with the framework here proposed will be integrated into the UISS framework, with the aim to represent and predict the disease progression of MS patients as well as to more realistically simulate the immune response to specific treatments.
The remainder of this paper is organized as follows. Section 2 resumes the state-of-the-art of MS lesions segmentation while Section 3 explains the employed dataset and the proposed method. Experimental results with stateof-the-art comparisons and ablation studies are reported in Section 4, whereas Section 5 concludes the paper.

State of the art
In recent years, various Artificial Intelligence (AI) methods based on deep learning have been proposed for the classification, detection, and segmentation of health-related conditions from medical images. For example, [19] utilizes deep learning methods for the classification of stroke in MR images, whereas [20] compares the classification performance of several deep learning architectures in ultrasound images for early diagnosis of carotid artery disease. Moreover, deep learning-based architectures have been utilized for the segmentation of various organs and tissues in medical images, including autoimmune disease segmentation using histopathological images [21], lung segmentation for Covid-19 prediction [22,23,24] or automatic segmentation of MS lesions from MRI scans; however, the results obtained are still far from those generated by manual segmentation. Furthermore, semi-automatic or automatic approaches have proven to be sensitive to MRI variability and different acquisition modalities, leading to a loss of accuracy. The results to date are still distant from those of human experts despite the enormous efforts. Recently, MS lesion segmentation methods have been classified into main categories, most of which include unsupervised, supervised and deep learning-based methods. Regarding supervised approaches,the authors of [25] proposed a method where lesions were segmented by applying an intensity threshold to the FLAIR image. In [26] a combination of a fuzzy classification method with an edge-based method is used, and the segmentation was obtained applying thresholding and a false-positive reduction technique. Regarding unsupervised approaches, [27] proposed an algorithm for MS lesion detection based on the intensity distribution of the three different tissues to detect lesions. In [28], the authors used a probabilistic model, Gaussian Mixture Model (GMM) to delineate lesion contours. The work in [8] proposed a framework by exploiting a Bayesian classifier and Markov Random Field (MRF) model to compute the a-priori probability for each tissue class. Most of the latest works exploit deep neural network-based methods. In particular, most of the published works employ methods based on Convolutional Neural Network (CNN) and U-Net architectures. In [29], the authors proposed an automated pipeline for serial analysis of MS lesions using FLAIR scans, relying on cross sectional segmentation of lesions in white matter. In [30], a multiclass FCNN model is proposed for brain tissue segmentation (gray matter, white matter, and cerebrospinal fluid) and MS lesions in T2-W scans. A framework for FLAIR segmentation is proposed in [31] by training two CNNs on MSSEG-2016 dataset in the axial, coronal, and sagittal directions. In [32] the authors use a multimodal 2D U-Net, encoding the different image modalities in separate downsampling channels, while [33] propose a combination of 3D networks for a spatially distributed strategy robust to domain-shifting.
The employing of attention has shown interesting improvements in some fields of medical image segmentation. In a recent work [34], segmentation of MRI FLAIR and T2 images is performed using a modified U-Net and Attention U-Net, proposing the fusion of the masks obtained from a better segmentation of Flair and T2. Another study [35] proposed a new dense residual U-Net model that leverages attention gate and channel attention techniques to improve the performance of automated MS lesion segmentation in MRI, while the authors in [9] propose a CNN based on two-paths architecture with the addition of a attention-driven interaction block between them able to share information between two different time points. Recent works demonstrated how the right use of attention in MS domain could significantly improve the results. The authors in [6] summarize recent researches on automated MS diagnosis based on Deep Learning (DL) and AI analyzing the features exploited, the preprocessing techniques employed and the challenges faced by published works, in part exploited in the current proposal.

Image dataset
The dataset employed in our method is a subset of the ISBI2015 challenge dataset; it is a public available set of images presented at the Longitudinal MS Lesion Segmentation Challenge [36], organized in conjunction with the ISBI 2015 conference. The full dataset is composed of 19 patients MRI scans, acquired at multiple time points on a 3.0 Tesla MR scanner, but only 5 patients are available with the corresponding segmentation mask. Each patient presents two different segmentation masks, produced by two expert human raters; it is important to note how in many cases the masks are different, which explain the difficult of the task also in presence of MS expert. The 14 patients without segmentation masks were originally used to validate the challenge algorithm but were discarded in our case for obvious reasons. An example of image belonging to the dataset and its relative mask is shown in Figure 2.
The selected 5 patients were acquired at different time points: 4 of them have 4 time points longitudinal scans, where the last has 5 time points, for a total of 21 different acquisitions; the time interval between two consecutive acquisitions is approximately 1 year. To note that the course of multiple sclerosis is highly variable and follow-up scans do not necessarily correspond to disease progression, as MS lesions may appear at different times and in different parts of the brain. Each acquisition contains the original MR images, the images after co-registration (geometric alignment of the images), brain extraction and non-uniformity correction, and the masks representing the MS lesion. Each scan contains different images sequences: T1-weighted, T2-weighted, PD-weighted, and FLAIR. To assess the stability of the model, we performed our experiments by evaluating our method only on the masks labeled by rater 1. To perform our experiments, only the FLAIR images were employed because MS lesions in white matter appear hyperintense and are more visible than other types of sequences; every FLAIR sequence is composed of 181 images of size 181 × 217.
We are currently generating a dataset containing MRI scans, acquired as part of project "In Silico World (ISW): Lowering barriers to ubiquitous adoption of In Silico Trials", with MS lesions labeled by two experts. This dataset will contain scans of numerous patients acquired at multiple time points; MR images were acquired by a 1.5 T scanner (Ingenia, Philips MR Systems, Release 4.1.3.2, Best, The Netherlands) under a regular maintenance program and sequences employed to reveal MS lesions were: 3D T2-FLAIR, Axial T2-FSE and 3D T1-gradient echo. An addictional test of the proposed method with an item of the future dataset was carried-out and presented in Section 4.4. The study was approved by the corresponding Hospital Ethics Committee and all patients gave their informed consent.

Preprocessing
State-of-the-art applied different image preprocessing methods, as coregistration [37], intensity correction [38] [39], skull-stripping [40] [41]. To avoid processing absolutely useless or marginal information, the removal of black images on terminal parts of each scan, where lesions are not present is usually performed. For the same reason, Hashemi et al. in [34] applied a removal of the black part outside the brain, where is not possible to find lesion areas, also in images containing lesions, in order to give only "active" information to the network. Although the action could seem obvious, the removal of these areas improves the results significantly. Figure 3 shows an example of the aforementioned preprocessing: it was applied in all the considered items of our method reducing the input images at 160 × 160 pixels. This action is also useful to overcome the imbalance inside the ground-truth masks ( Figure 2b): our goal is to identify the lesions, which are identified by white pixels, while the rest of the image is identified by black pixels (a binary classification). It is evident how in the mask image ( Figure 2b) the number of pixels of white area (target) is sensibly less than black ones enforcing the model to better predict dominant areas. Removing the whole black masks and resizing the image also allow, as a side effect, to significantly reduce the overall training time.
After the above-mentioned preprocessing steps, we obtain images with corresponding square masks of size 160 × 160, that constitute the input to the network.

Proposed model
In recent years Medical Imaging researchers demonstrated how U-Net and its customized architectures [10] provided effective results in various scenarios. The capacity of U-net to produce detailed segmentation maps, using a very limited amount of data, makes it particularly helpful; it assumes relevance in the context of medical imaging since access to large amounts of labeled data is very limited.
The state-of-the-art employed U-Net in medical image segmentation adding some customization to enhance the results; in our case squeeze and attention block [14], as reported in next sections, sensibly improve the results on MS lesion segmentation. Before introducing the overall architecture, the main blocks added to our segmentation network will be described below.
Squeeze-and-attention module. Squeeze-and-attention (SA) modules [14] attempt to emphasize channels that contain informative features and suppress the non-informative ones. This module performs a re-weighting technique that pay attention locally and globally; locally because the convolutional operations are performed in a small pixel neighborhood, while globally they selects which image feature maps to focus on to perform segmentation. SA extends the feature recalibration operations performed by the squeezeand-excitation (SE) modules [42] to not apply fully-squeezed operation to spatial information.
Convolutional LSTM. Convolutional LSTM [15], combines the advantages of RNN and CNN architectures. It introduces convolutional layers in place of fully connected layers in an LSTM to enable more structure in the recurrent levels. In medical image segmentation, spatial information is essential to be able to reconstruct an entire area. For this reason, a convolutional LSTM uses the convolution operator in recurrent connections to learn the spatial features of adjacent images.  Figure 4 shows the proposed MS Lesion segmentation architecture. It is build starting from a Fully Convolutional Densely Network (known as Tiramisú network [11]) based on a modified U-Net structure [10]. As mentioned before, compared to the conventional U-net architectures for MS lesion segmentation [43], squeeze and attention blocks to both the downsampling and upsampling path were added to emphasize the more informative feature maps; also a unidirectional convolutional LSTM [15] was inserted in the bottleneck, in order to catch spatial correlation of sequentially axial slices. These slices are processed independently in the first part of the net (convolutional operations) and combined in the network bottleneck to produce the final output. In particular, the segmentation result of the central slice is obtained providing to the network an input with the same slice and the two adjacent ones, (previous and next): it is easy to observe how lesions are propagating over the space with a similar shape. We chose the number of 3 slices since the MS lesion is more likely to be within this spatial sequential and because in a greater sequence the lesion could lead not only to considering lesions with different structure, but also increase the training time.
The architecture consists of a downsampling path composed of a convolutional layer and five sequences composed of: a dense block, a squeezeattention block and transition down blocks. The upsampling path is symmetrical to the downsampling one and also each couple upsampling/downsamplig block is concatenated through Skip connections.
The proposed network consists of over 400 levels. As it is not possible to include a table listing all the levels, these are described using the grouping in Table 1. As mentioned earlier, the network takes FLAIR images as input because lesions are more visible in this modality, generating the lesion segmentation mask as output. Within the architecture each slice is shown in grayscale, in which every pixel value contains only information on intensity. The group of three slices were passed through a standard convolutional layer, which is needed to increase the size of feature maps. Then they go through the downsampling path, also called encoder, consisting of a sequence of dense blocks, squeeze-attention blocks, and transition down blocks. In the downsampling process, the spatial resolution of the images was gradually reduced and the number feature maps were gradually increased. The advantage of applying squeeze-attention modules is to emphasize channels that contain informative features and suppress all other non-informative ones. Specifically, it is demonstrated that the SA block introduces a pixel-group attention mechanism with a convolutional attention channel, which allows the network to selectively focus on the most significant groups of pixels in the input image, while excluding other groups. This is achieved through spatial attention, where neighboring pixels of the same class are grouped together and treated as a single unit during processing, allowing for pixel-wise prediction [14]. In particular, in the case of multiple sclerosis lesion segmentation, it is important to consider the relationships between pixels in a group, as lesions often have a distinct shape and structure. In this path, the output feature maps from each transition down level are concatenated with the output feature maps of each squeeze and attention level, and used as the input of the next level. The downsampling path is followed by the bottleneck, which is tipically characterized by a sequence of levels that process the slices when they have the lowest possible spatial resolution. It consists of a dense block and a unidirectional convolutional LSTM layer. A 2D convolutional approach was  chosen instead of 3D, as the dataset employed had a limited number of samples available. By incorporating the LSTM layer, the network can capture the spatial dependencies between adjacent slices, leading to better feature representations and improved accuracy in the final output, thereby focusing on the sequentiality of the scans instead of giving an entire scan per single step. By doing so, the sequential task has many more samples than the 3D task. At the end of the bottleneck there is the upsampling path, also called the decoder, which is symmetrical to the downsampling path and is useful for recovering the input spatial resolution that is lost during the previous path. The spatial resolution of the images is then gradually increased and the number of feature maps is gradually reduced. The main characteristics of the upsampling path is the presence of skip connections, which concatenate the future maps at the exit of each Transition Up blocks with those that have the same resolution coming from the downsampling path to create the input of the next layer. The skip connection were useful to recover spatially detailed information lost during the downsampling path. At the end of the upsampling path, there is a convolutional layer and the softmax which encodes for each pixel a probability for each possible class. Thus, the output of the model is the segmentation mask of the central slice of the sequence.

Evaluation metrics
The evaluation of the model was done comparing the predicted segmentation masks with the reference ones that, as mentioned previously, were chosen by only one of the experts as ground truth.
As evaluation metrics, Dice score, sensitivity, specificity, Extra Fraction, Intersection Over Union (IOU), Positive Predictive Value (PPV) and Negative Predictive Value (NPV) were used. The Dice score [44] is defined in Eq.(1), where TP, FP and FN denote the number of True Positive, False Positive and False Negative pixels, respectively. Dice score is a metric used to measure the similarity between two classes, widely used in medical image segmentation. The Sensitivity is defined in Eq.(2), Sensitivity measures the number of positive voxel that are properly identified. Specificity is defined in Eq. (3), Specificity measures the number of negative voxel that are properly identified. Extra Fraction (EF) is defined in Eq. (4), Extra Fraction measures the number of voxels segmented that are not in the reference segmentation. Intersection Over Union (IOU) is defined in Eq. (5), Intersection Over Union measures the number of voxels segmented that quantifies the degree of overlap between two region. Positive Predictive Value (PPV) is defined in Eq. (6), Positive Predictive Value measure the number of positive voxel that are true positive results. Negative Predictive Value (NPV) is defined in Eq. (7), Negative Predictive Value measure the number of negative voxel that are true negative results.

Experimental setup and hardware specification
As explained in Section 3 the experiments were performed employing images obtained after the pre-processing phase, removing the black areas. Given the low number of patients (only 5) and scans (only 21), the tests were carried out through a cross-validation strategy considering 5-fold, with 17 scans to train the network, 3 scans used as validation set and a scan to test it. As done by Hashemi et al. in [34] part of patient's scans were employed during training phase while the remaining ones for testing.
• Fold1 includes a total of 1119 images as a training set, 197 images as a validation set and 70 images as a test set (patient 1 at T4).
• Fold2 includes a total of 1119 images as a training set, 183 images as a validation set and 84 images as a test set (patient 2 at T4).
• Fold3 includes a total of 1119 images as a training set, 200 images as a validation set and 67 images as a test set (patient 3 at T4).
• Fold4 includes a total of 1136 images as a training set, 208 images as a validation set and 42 images as a test set (patient 4 at T4).
• Fold5 includes a total of 1111 images as a training set, 225 images as a validation set and 50 images as a test set (patient 5 at T4).
The proposed approach was implemented in Python language (version 3.9.7) using Pytorch [45] package. All experiments were done on a NVIDIA Quadro RTX 6000 GPU. The network was evaluated using the Dice loss function, which considers both local and global information. Network training was performed for 200 epochs well beyond the average converging rate, through the usage of Stochastic Gradient Descent (SGD) [46] as optimizer with an initial learning rate of 1e − 4, a weight decay equal to 1e − 4 and a batch size fixed at 4. Figure 5 shows the Dice and loss curves obtained during training considering the various folds as configurations. The best model was then selected to perform all tests based on the highest Dice value achieved by the validation set for each fold. The training computation time for 200 epochs was approximately 20 hours. No data augmentation was applied during the training process. To demonstrate the absence of overfitting, a subsequent test was performed by applying random transformations to the training data, including flipping and affine transformations of the images. This experiment helped to ensure the model's convergence while avoiding overfitting. The results obtained from the additional training are very similar to those reported in Figure 5. This suggests that the model's performance is robust.

General results
To properly evaluate the performances of the proposed approach a set of tests were conducted, in which the employed folds contain multiple combinations of the data. The averages ± Standard Deviation (SD) of all metrics obtained in the cross-validation test folds are reported in Table 2. From the comparison of all the results trained in the different folds, it can be concluded that the model achieves the best result in terms of Dice score in Fold2 (89%). In general, high values are achieved for all metrics and the results being very comparable between all folds.
It is possible to appreciate visually the results of the proposed approach, in the test set, in Figure 6, where is possible to observe the segmentation results on slices of the same patient extracted from three different regions of the brain. Figure 6 shows slices of the same subject (a), (e), (i), the ground truth segmentations (b), (f), (j), the segmentations obtained by the proposed approach (c), (g), (k), the false positive and false negative pixels distincted in red and green pixels respectively in (d), (h), (l).
As can be observed, the predicted lesions mask is very similar to the ground-truth mask, so the proposed approach segments most of the lesions with good accuracy. False-positive and false-negative pixels mask, confirms how most of the Flair MS lesions were correctly detected by the model.
The proposed approach has been compared with recent state-of-the-art solutions based on 2D/3D U-Net for MS lesion segmentation. The comparison was done through the mean Dice score between methods on ISBI2015 dataset. Table 3 shows the results of state-of-the-art, our results achieved in the best fold (Fold2 in our case) and our mean calculated considering all the involved folds.
State-of-the-art methods used for sake of comparisons are the following: [33,47,48] (3D U-Net architecture), [32,35,49] (2D U-Net architecture). Also we included [50,51,52] based on a CNN model while [43] make use of a Tiramisú network by combining slices in the three anatomical planes to capture both global and local contexts. The results of Table 3 Table 2: Average of the evaluation metrics for the proposed approach in the different folds for the test data. The last row shows the average among all folds (Average ± SD). false-negative pixels (in green), for slices taken from three different regions of the brain, respectively.
our framework improves the results of the state-of-the-art by about 7 of Dice Score. The results obtained from the method proposed by [34] are the most similar to ours; [34] making use of two segmentation networks for MS lesions, an Attention U-Net and a U-Net, and presents the results obtained with both networks on the different MRI acquisition modalities. The comparison between our results and [34] are referred to FLAIR images achieved considering in both the results of the patient corresponding to our Fold2 as test. Furthermore, it is possible to note how the use of attention mechanisms does not give advantages to [34], as Attention U-Net has worse results than simple U-Net. The results obtained with our method in Fold2 exceed the results obtained by [34] for both networks.

Additional test
As previously mentioned, we are in the process of building a new dataset containing FLAIR scans of multiple sclerosis patients with expert-labeled MS lesions.
To evaluate the performances of the proposed method (with the model trained on the ISBI-2015 dataset) an additional test was done employing  MR FLAIR images from three patients of our in progress dataset. As depicted in Figure 7, the ground truths of three patients (P1, P2, and P3) were overlaid with the corresponding segmentations obtained by our model and the resulting masks. The high number of false negative pixels indicates that lesion contouring is the task where our network is less accurate. The larger are the lesions the less accurate is the contouring. Table 4 reports individual fold and average Dice Scores from each patient. Dice Score performances obtained by our method are somehow different when the results from training dataset are compared with the test scans, but the mean Dice Score of 0.7730 achieved for P1 represents a satisfying result in terms of accuracy and lesion segmentation. This discrepancy may be due to differences between acquisition scanners, as a 3.0 T scanner was used for the training images (ISBI-2015), whereas a 1.5 T scanner was employed for testing images. In addition, a reduced performance on test scans may reflect the fact that they were obtained from a single time point, whereas the scans of training set included multiple serial acquisitions for each patients, which improved the accuracy of our automated segmentation method.

Ablation studies
In order to explain the reason behind the design of the employed architecture was chosen some ablation studies were done. The proposed network architecture consists of a main backbone to which several modules have been added, then some variants will be presented in this Section. Specifically, some tests were carried out removing parts of the network or replacing them with others in order to obtain a better explanation of the model behavior and overall achieved performance. The purpose is to quantitatively measure  the contribution of each parts to the overall model. Starting from a specific model, i.e. the Tiramisú network architecture with squeeze and attention layers in the two paths and the unidirectional convolutional LSTM layer in the bottleneck, (shown as FC-DenseNet + SA + C-LSTM in Table 5), three different configurations were considered: Basic Tiramisú model (FC-DenseNet in Table 5), Tiramisú model with the addition of the unidirectional convolutional LSTM level in the bottleneck (FC-DenseNet + C-LSTM in Table  5) and the Tiramisú model with the addition of the squeeze and attention modules in the two network paths (FC-DenseNet + SA in Table 5). Every configuration was tested on two of the five folds described in section 4.2, chosen on the basis of the results obtained in Section 4.3. In particular, the two folds with the best and worst Dice Scores in test experiments were chosen, Fold2 and Fold4, respectively. As can be verified from Table 5, the ablation studies demonstrate how SA module always improves the performances while C-LSTM works only if it is in couple with SA.

Conclusion and future work
In this paper we proposed a new framework to address the problem of MS lesion segmentation on MRI in the effort to facilitate the estimation of disease burden overtime. In particular, our approach is based on an extension of the U-net neural network. The proposed method demonstrated to be more accurate than state-of-the-art methods, boosting results by exploiting a dedicated attention mechanism. It is worth noting that the simple insertion of attention does not always improves results [34], whilst only a dedicated solution, as the novel one presented in this paper, could be able to provide substantial improvement. The effectiveness and robustness of the technique was demonstrated for the first time on patients never employed for the training of the model. The high level of Dice Score, obtained by the proposed method on this particular sample, is of utter importance in demonstrating the generalizing capabilities of the solution, as it is not dependent to a specific acquisition hardware and method. To further investigate these capabilities, we are continuing the acquisition campaign with the aim to have new samples and enrich the comparison dataset. Furthermore, the lesion segmentation framework proposed in the paper uses recent AI methodologies to estimate the level of progression of the MS disease by recognizing automatically the lesions in MRI images. The obtained data about the quantitative MRI lesion load will be used in the UISS framework, which has the aim to model and simulate the progression of MS lesions as well as to predict the immune response to specific treatments. A potential future direction for the research is to explore the possibility of replacing the recurrent layers with 3D convolutions to enhance the performance of the network.