How far MS lesion detection and segmentation are integrated into the clinical workflow? A systematic review

Highlights • There is lacking knowledge about how automated tools for lesion detection/segmentation in multiple sclerosis (MS) perform within a clinical setting and about how they might be integrated in clinical practice.• The value and economic impact of those tools on patient management is unclear.• The development of new tools for automated MS lesions detection/segmentation should include their integration in the clinical workflow.


Introduction
Multiple sclerosis (MS) is an inflammatory demyelinating disease of the central nervous system, which affects almost 3 million people worldwide (Walton et al., 2020). MS is the most prevalent neurological disease among young adults, and it is associated with a progressive increase in disability, which can significantly affect an individual's quality of life as well as impose a substantial economic burden on patients, their families and the entire society (Feinstein, 2004). MS mostly exhibits focal inflammatory and degenerative lesions, but also diffused brain and spinal cord damage, which ultimately results in permanent brain volume loss (Reich et al., 2018). Hence, assessing the impact of neuroinflammation and neurodegeneration in patients through the identification of adequate imaging biomarkers is fundamental.
MS diagnosis requires the demonstration of dissemination in space (i. e., specific regions of the brain and spinal cord must be affected by areas of focal inflammation/damage, which are named plaques or lesions) and time (i.e., assessment of the increase in lesions' number and volume over time). The information provided by Magnetic Resonance Imaging (MRI) can address both requirements and, therefore, it is essential for MS diagnosis (Thompson et al., 2018). Fig. 1 shows the appearance of MS lesions in brain MRI.
The process of MS lesion detection and segmentation is usually performed manually by trained neuroradiologists and it is a timeconsuming task and prone to errors (Egger et al., 2017). As a consequence, the development of automatic tools to support this procedure is urgently needed.
To date, several automated approaches have been proposed to support this key task, leading to a plethora of tools (reviewed in Llado et al., 2012;García-Lorenzo et al., 2012;Alrabai et al., 2022;Zeng et al., 2020;Diaz-Hurtado et al., 2022;Commowick et al., 2023) that are more or less mature towards clinical application and use. In the last 15 years, many international challenges, organised in the context of the Medical Image Computing and Computer Assisted Intervention (MIC-CAI) conference (Styner et al., 2008;Commowick et al., 2018;Kuijf et al., 2019;Commowick et al., 2021) and the International Symposium on Biomedical Imaging (ISBI) (Carass et al., 2017), provided benchmark datasets to promote a fair evaluation. In addition, the Shifts Challenge (Malinin et al., 2022) focused on the estimation of robustness and uncertainty of such methods.
To facilitate the adoption of automated image analysis tools in the practice of clinical neuroradiology, Goodkin et al. (2019) proposed a framework based on a sequence of six steps named Quantitative Neuroradiology Initiative (QNI). The six steps can be summarised in providing: 1. the target clinical area and biomarkers; 2. the structure of the automated method; 3. a quantitative report; 4. a technical and clinical validation; 5. details about the integration into the clinical workflow; 6. an in-use evaluation.
Although these requirements were originally applied to the radiological assessment of dementia, the framework was adopted later on to conduct systematic reviews on commercial volumetric MRI reporting tools in dementia (Pemberton et al., 2021) and MS (Mendelsohn et al., 2022). While in the present work we aim to present the state of the art of scientific literature, the mentioned reviews strictly focused on studies related to commercial devices. Table 1 describes the requirements to fulfill the six QNI steps. The first and second steps include the identification of the target clinical area, the associated imaging biomarkers (lesional in the case of MS), the automated model's structure, and reference datasets. A third phase consists of filing a visually informative quantitative report, to be integrated into the radiology report. The fourth step relates to the technical and pre-use clinical validation, which encapsulates a "credibility" and "accuracy" study. The former concept suggests a data quality check and a review of the technical performance of the method. The latter term refers to a blinded rating of a limited number of cases and an assessment of the clinical reporting process: radiologists' accuracy and reporting efficiency should be examined, with and without the automated tool, in the closest possible environment to the usual radiology setting. The fifth step is the integration of tools into the clinical workflow, from data format compatibility to data protection and the joint visualisation of Digital Imaging and Communications in Medicine (DICOM) series and model output. The final phase describes an in-use pipeline evaluation for what concerns patient management and the socio-economic impact of the tool. Key concepts are the smoothness of the tool's integration into a hospital's radiology department, speed of diagnosis, cost in resources, productivity, general perception, and mid-term economic impact. With these criteria, the proposed review analyses to what extent current literature of reporting automatic tools for detection and segmentation of MS lesions follows the QNI steps and, thus, considers the integration into the clinical routine.

Material and methods
In this review, we adopt the methodology described in the "Cochrane Handbook for Systematic Review of Interventions" (Lefebvre et al., 2022) to collect published articles till June 2023. To broaden and differentiate the screening pool, we targeted two databases, respectively medicine-and engineering-oriented: PubMed (https://pubmed.ncbi. nlm.nih.gov/) and IEEE (https://ieeexplore.ieee.org/Xplore/home.jsp ). We adapted Cochrane's threefold subdivision of screening keywords  integration into the clinical workflow (data format compatibility and protection, input-output viewer) 6 th in-use evaluation (patient management, socio-economic impact) to our case, from "population, intervention and study design" to "population, task, and design of the tool". While population refers to clinically confirmed MS patients, task and design describe what we expect as the automatic model's first output and general characteristics. By searching within both databases using Cochrane's threefold subdivision of keywords, 770 studies were extracted (123 from IEEE and 647 from PubMed). Please note the word including a * is a "wild-card", covering suffixes from a word stem, such as "automat*" stands for "automated" and "automatic": 1. "multiple sclerosis"; 2. "segment*" OR "detect*"; 3. "machine learning" OR "deep learning" OR "automat*" OR "digital tool".
The above criteria were applied to all metadata, including title, abstract, and keywords.

Study inclusion criteria
Screened articles were included in the review when they met all the following inclusion criteria: 1. original research published after 2011 in academic peer-reviewed journals or conferences in the English language; 2. studies targeting fully automatic detection or segmentation of white matter non-enhanced (without contrast agents) lesions, as either a primary or a secondary objective; 3. studies targeting brain MRI modalities; 4. studies targeting clinical MS population alone or mixed with patients with a clinically isolated syndrome (CIS, a first symptomatic episode of potential MS); 5. studies performing either technical, clinical, or in-use validation.
As a consequence, papers including a wider population than MS (in separate datasets), performing longitudinal or cross sectional evaluations, presenting a different primary goal or other lesion types , have been reported in this review only if they also met the mentioned conditions. For each QNI framework's step, the methodology of reviewed articles was discussed and evaluated as compliant or not compliant. It must be noted that failure to comply with some steps towards clinical use of those methods does not imply any superficiality in the methodology applied. It indicates, instead, that an article focuses primarily on other objectives.
In our work we distinguish among technical, clinical and in-use assessment as follows: • Technical validation: comparing results to manual segmentation and/or state-of-the-art segmentation software, and data quality checks. • Clinical validation: refers to any evaluation of the tool's impact on clinical management, diagnostics, and reliability with respect to the reference annotated "ground truth". • In-use evaluation: includes any study measuring how easily the tool can be integrated into reporting workflow, benefits for patients, general perception, and socio-economic effects of the tool.
Merging results from the two databases, 22 records were excluded as duplicates leading to 748 studies to further review. Upon examination of the pool of abstracts, 562 occurrences were not retrieved as not compliant with the inclusion criteria. After carefully reviewing the full texts from the remaining 208 studies, 52 articles were further excluded due to their objective, population (e.g., dementia), input type (e.g., synthetic data), language, availability and method (only fully automatic methods were considered). The PRISMA flow diagram (Page et al., 2021) describing the procedure to select 156 studies to include in the review is reported in Fig. 2.
The search strategy was peer-reviewed by S.S., an experienced information specialist within our team. All data used in the review are available and can be accessed through PubMed and IEEE databases.

Results
Following the described methodology, 156 studies were identified (see the list of abbreviations in Table 2 and first columns of Tables 3-5), which met all the inclusion criteria (Fig. 2).

Target population
In ten articles, MS patients were mixed with subjects presenting CIS (Salem et al., 2020;Jannat et al., 2021;Salem et al., 2019;Valencia et al., 2022;Salem et al., 2017;Dwyer et al., 2019;Sitter et al., 2017;Cabezas et al., 2016), or neuromyelitis optica spectrum disorders and cerebral small vessel disease , or mild cognitive impairment, Alzheimer's disease, Parkinson's disease and frontotemporal dementia . The remaining studies targeted at least one dataset with only MS patients (see second columns of Tables 3-5).

Magnetic resonance imaging
Fluid attenuated inversion recovery (FLAIR) was the most common MRI contrast used as input for the proposed automatic methods. It was used alone or in combination with a T1-weighted (T1-w) image, a T2weighted (T2-w), a proton density weighted (PD-w) image, or with contrast enhancement (see third columns of Tables 3-5). In six cases, the only input provided to the network were either T2-w images (Abhale et al., 2022;Yildirim and Dandil, 2021a), MPRAGE (Magnetisationprepared rapid gradient echo) (Galimzianova et al., 2015;Spies et al., 2013), MP2RAGE (Magnetisation-prepared 2 rapid gradient echo) (Fartaria et al., 2019) or MR fingerprinting EPI (Echo-planar imaging) (Hermann et al., 2021). Less common contrasts, such as diffusion basis spectrum imaging (Ye et al., 2020), DIR (Fartaria et al., 2015;Schläger et al., 2022;Bouman et al., 2023) and PSIR (Bouman et al., 2023) were Fig. 2. PRISMA flowchart applied during the screening process. The terms "objective", "population", "input type", "language" refer to the inclusion criteria of Section 2.1. The term "access" refers to an exclusion due to the impossibility to access the full text of a paper. The term "method" refers to "not fully automatic methods". also adopted.

Datasets
The methods developed by 92 studies were (at least partially) based on datasets from international challenges: MICCAI 2008 (Styner et al., 2008), MICCAI 2016 (MS-SEG) (Commowick et al., 2018), MS-SEG2  and ISBI 2015 (Carass et al., 2017). Earlier works focused on relatively small cohorts due to the limited sample size provided in the challenges, such as 5 and 20 patients, respectively, in the training set of ISBI 2015 in Vang et al. (2020) and of MICCAI 2008 in Joshi and Sharma (2022).
A single case (Tripoliti et al., 2019) did not provide any reference dataset. The authors proposed the architecture of a tool for the estimation of MS progression, announcing a future proof of concept study with 30 patients for its validation. Since the target area was clearly determined, the first QNI step was considered satisfied.
The remaining 83 studies were based on data from large clinical trials, University hospitals or publicly available sources (see second columns of Tables 3-5).
The high-level category of deep neural networks was predominant, where convolutional neural networks (CNNs) as U-Nets were most represented (see fourth column of Tables 3-5). Basaran et al. (2022) adopted nnU-Net (Isensee et al., 2021), a method that automatically configures pre-processing steps, architecture, training and post-processing to better adapt to dataset properties and available hardware.
In Tripoliti et al. (2019) no details were disclosed about their automatic method and, as mentioned in Section 3.3, the reference dataset was not described. As a consequence, this conference paper did not fulfill the second QNI step.
Longitudinal methods (i.e., assessing changes in lesions' number and volume across two or more time points) adopt different approaches compared to cross-sectional methods (i.e., those using images acquired at a single time point). In fact, the evaluation of follow-up scans presents challenges, such as the one related to image registration-if patient positioning is not consistent-, and the one concerning the required preprocessing steps to account for variations in image acquisition between scans. Moreover, new lesions in follow-up scans are usually small and there is currently no threshold defining a significant lesion enlargement. To overcome these challenges, different approaches have been proposed to date such as the one proposed by Salem et al. (2022)-using a cascade of two FCNN's to refine possible misclassifications-or the one suggested by Sepahvand et al. (2020), where an attention mechanism based on image subtraction between two timepoints was applied to help a U-Net differentiating between anatomical and artifactual change.

Data quality check and pre-processing
Data quality check, if mentioned, consisted of the removal of null slices (Ghosal et al., 2020;Kumar et al., 2019;Alijamaat et al., 2021;Rondinella et al., 2023), control of the scanning protocol and a thorough visual inspection (Schmidt et al., 2019). In Cavedo et al. (2022), before computing MRI analysis, a quality check of MRI parameters is performed to verify that the parameters align with those recommended. An image quality assessment was also explored in Valencia et al. (2022), through the median absolute error and the structural similarity index. Other metrics, such as lesion conspicuity, SNR (signal to noise ratio), contrast to noise ratio, and variance of the Laplacian were selected in Arnold et al. (2022). Narayana et al. (2018) used the automated pipeline validated in Narayana et al. (2013) to check headers and the SNR of DICOM images.
A more careful approach was developed in Rakic et al. (2021), dealing with T1-w and FLAIR modalities of 159 MS patients from multiple centers and scanners. In order to preserve robustness and minimize data bias, the authors followed a carefully designed protocol: the stratification of training, validation, and test set was obtained in a way to equally represent all data characteristics, such as screening site, scanner model, magnetic field strength, scan quality, slice thickness.
In Todea et al. (2023), two experts performed an image quality assessment (SNR, artifacts, contrast, good registration between time points) and a longitudinal analysis was evaluated on the whole dataset and on images with the same quality score. The same concept was applied to images obtained with a 1.5T and 3T scanner. On the other hand, in Combès et al. (2021), data with lower quality were intentionally not excluded from the study to mimic a real-world scenario.
Most studies include the following data pre-processing steps: bias field inhomogeneities correction, intensity normalization, skull stripping, denoising, resampling, and co-registration in the case of multiple input modalities.  Table 3 Studies' information containing details on datasets, inputs, and architecture of the automatic algorithm, pre-processing steps, and evaluation metrics (part 1).

Quantitative reports
The results presented in 148 studies did not provide radiologists with a summary report. In Cavedo et al. (2022), the authors presented a report with detection scores and the overlay of predictions on original images, while Yildirim and Dandil, 2021a generated a similar documentation in a web-based user interface tested by two radiologists.
In Bilello et al. (2013), the generated report contained new (or enlarging) and resolved (or improving) lesions detected, their specific location and the cerebral hemisphere involved.

Technical validation
The commonly explored technical evaluation metrics were those required to participate in the international contests (Maier-Hein et al., 2022): 1. Overlap-based metrics, such as Dice score coefficient (DSC), sensitivity (recall), specificity, precision, accuracy, lesion-wise true positive rate (TPR) and false positive rate (FPR), the absolute volume difference between ground truth and predicted segmentation; 2. Surface-based metrics, such as the average symmetric surface distance.
The lesion annotation through consensus was improved in the latest challenges: the available ground truth (GT) masks are more reliable in terms of inter-observer variability, providing higher quality GT to train and evaluate the models. An exhaustive list of adopted metrics is reported in the sixth columns of Tables 3-5.
The latest reviews (Diaz-Hurtado et al., 2022;Commowick et al., 2023) report satisfactory and already close to human rater performances for many detection/segmentation automatic methods. However, as also mentioned in Commowick et al. (2023), there are currently little data related to the integration and use of those methods in clinical routine, especially in relation to the quantification of the uncertainty of their predictions in clinical practice. Combès et al. (2021) proposed a pre-use validation of their tool involving clinicians. The authors assessed the impact of the segmentation tool on experts' performances as follows: three experts were asked to annotate a point near each lesion's center (for 48 patients) with and without the help of the automatic tool (referred to as phases one and two). The number of marked lesions and time spent during the procedure were recorded in both cases. All experts were exhorted to conduct this experiment in situations similar to clinical practice. In particular, they were explicitly instructed to spend a reading time comparable to that of clinical routine. A few days prior to the first phase, each expert followed a short training session to get acquainted with the tool.

Clinical validation
This experiment was evaluated through several metrics and compared between the two phases, such as the number of detected lesions (by each rater and overall), the average patient-wise number of lesions detected by experts (compared between phases using a paired ttest), or the pooled inter-expert standard deviation associated to the number of detected lesions.
In addition, the impact on routine clinical practice was assessed on six patients, with and without the tool (the two phases were two weeks apart): the experts measured the time needed from loading and reading MRI in hospital Picture Archiving and Communication Systems (PACS) to generating a radiology report. Patients were categorized in the report as showing "no activity", "1 lesion" or ">1 lesion" with respect to baseline. Time spent to perform radiological readings for each of the three experts and each of the two settings were summarized, and the mean times elapsed in the two settings were tested for equality using a paired t-test.
A post-experiment interview was conducted to ask experts whether they were satisfied with the tool's level of information and performance.
In Van Hecke et al. (2021), lesion segmentations were compared with the assessment of two raters, one experienced radiologist and one assistant neurologist. The experiment consisted of marking and counting MS lesions on images from 10 patients. The two raters independently assessed all images, which were shuffled and presented first as original scans, then with automatic lesion annotations. The reporting time was recorded, and the agreement between the counts reported by the two raters with and without the tool was analysed. Moreover, a similar procedure was followed to test if the help of automatic reports might change radiological findings when assessing follow-up scans.
In Bilello et al. (2013), two neuroradiologists generated a clinical report without assistance from the CAD software. Independently, the same scans were assessed by another neuroradiologist using only the software output. In both cases, new, enlarging, resolved and improving detected lesions were compared, as well as the specified lesion location. The duration of the software-assisted pipeline was also recorded for each scan, not including the image processing time.
Yildirim and Dandil, 2021a reported having their pipeline tested by two radiologists and evaluated as an auxiliary tool for diagnosis and decision support in terms of ease of use, practicality, working speed, and automatic detection. Since no details on the modality of these tests were disclosed in the article, the fourth QNI step can not be considered fulfilled.
Similarly, Hindsholm et al. (2021) only presented a qualitative assessment of output masks by radiologists. Hence, their clinical validation does not comply with the QNI framework.
Technologists involved by Thakur et al. (2022) reported the time for manual intervention to execute the tool and the time to assess and  Table 4 Studies' information containing details on datasets, inputs, and architecture of the automatic algorithm, pre-processing steps, and evaluation metrics (part 2).  generate a report for a single patient. However, they used these findings to compare two versions of the same software instead of evaluating advantages with respect to a manual assessment. For this reason this article did not fulfill the fourth QNI step.

Integration into clinical workflow
In Bilello et al. (2013), the DICOM series of all the paired examinations were available in PACS to be exported and used as inputs to the automated method. Similarly, in Tripoliti et al. (2019) the user can retrieve imaging data either from the PACS or the local disk of the computer where the automatic software is installed.
In Combès et al. (2021), once stored in the local clinical PACS, MR images were pseudonymized and securely transferred into a processing hosting (certified health data hosting provider), and new lesions were automatically segmented. Then, the processed images and corresponding segmentation maps were transferred back to PACS, which could be visualized in a dedicated web MRI viewer (using DICOM format).
Van Hecke et al. (2021) developed a platform including a web portal for healthcare professionals, volumetric brain reports, and the integration with hospitals' PACS and electronic medical record systems.
In Thakur et al. (2022), the automated software was integrated and routinely used in clinical practice since April 2012. The images were stored in PACS and converted from DICOM to NIfTI (Neuroimaging informatics technology initiative) for processing. The authors mentioned their method needs MRI scans to be acquired at the same institution.
The integration of the tool into the clinical workflow was only partially investigated in Yildirim and Dandil, 2021a, including data compatibility and the visualisation of segmented lesions overlayed with the input image. Yet, the integration of their web-based system with a hospital electronic information system, such as PACS, was not considered. Thus, the fifth QNI step was not satisfied.

In-use validation
Van Hecke et al. (2021) presented and tested a care management system, including a patient mobile phone application (available on Android and iOS) and a website. A first patient's perspective survey was conducted to understand patients' attitude towards the app, different possible features, and their level of interest in using such application. A second survey collected information such as patients' propensity to view MRI images on their own, or if they would be interested in knowing whether there were any changes in follow-ups (such as new lesions or brain volume loss).

QNI steps fulfillment
Based on the findings presented in 156 studies, 146 comply with the first QNI step, while 155 fulfill the second. The third step is considered by eight works, three studies fully investigate the fourth and five the fifth. Only a single article explores the last QNI step. An overview of the fulfillment of QNI steps in the screened literature is presented in the road map of Fig. 4a. A similar road map can be generated from data related to 10 commercial devices screened by Mendelsohn et al. (2022), reported in Fig. 4b. A summary of the fulfilled steps is reported in the last columns of Tables 3-5.

Discussion
The present systematic review exposes a considerable gap between methods' development and the introduction of those methods into clinical practice. There are many possible cause for this gap.
A first explanation could be the difficulty to implement clinical trials: complying with clinical regulations and addressing ethical issues might result in an undesirable delay of the investigation. Participants' insufficient knowledge about trial methods and the complexity of study protocols might also jeopardise patients' recruitment process. The lack of trained medical personnel could represent a problem, when designing a clinical trial and even in the case of an internal clinical validation. All the above reasons are not specific to MS, meaning they could apply to many other neurological and non-neurological disorders.
Clinical integration presents, as well, some significant hurdles. To be applied in clinical practice, lesion segmentation methods should not only be integrated in the clinical workflow (i.e., be integrated in clinical PACS systems; be readily applicable to MR data that have not been preprocessed and sometimes acquired in different scanners, or with different image quality despite a consistent acquisition protocol, etc.) but also provide means to evaluate their outcome's uncertainty and errors. Ad-hoc integration designs need to be developed considering the current clinical neuroradiological workflow as well as evaluating the reliability of those methods in a clinical routine setting, and the related clinicians' trust in using them as clinical decision support tools. To help cover these aspects, an automatic tool could be conceived within a quality management framework for medical devices. The handling of possible failures, risk monitoring and data storage would also be addressed by following such guidelines. Data storage, management and sharing systems, such as KHEOPS (https://kheops.online/) or Flywheel (https://flywheel.io/), could be a way to deal with PACS and anonymise imaging data acquired at hospitals. Moreover, the use of a docker to execute software in an isolated and reproducible environment could help towards clinical integration. As to the real advantage of using automated methods in clinical routine, these should be carefully evaluated on site by providing means to assess errors and eventually also correct them for future evaluations, as for example could be done with uncertainty estimations/explainable AI and user-friendly interactive interfaces.
Along with this, the trade off between the economic costs of a clinical implementation and MS incidence may play an important role. In this sense, addressing medium to long-term effects (last QNI step) of the tool would be helpful. Studies should provide documentation such as: 1. periodical reports on how easily the tool could be integrated and feedback from users 2. the speed of diagnosis and failure rate, compared to pre-use cases 3. the amount of required resources, productivity, patient perception, and economic impact.
On the other hand, if a tool is not clinically adopted, its efficacy and perception could be part of the reasons. An extremely wide range of solutions with respect to the methods characteristics, inputs, and processing steps is already available and discussed in reviews, such as  Studies' information containing details on datasets, inputs, and architecture of the automatic algorithm, pre-processing steps, and evaluation metrics (part 3).

Data Inputs Method
Pre-processing Evaluation metrics QNI steps fulfilled (Karpate et al., 2015) 16 MS and 20 HC MPRAGE, T2-w, FLAIR Least squares probabilistic classification b.c., denoising precision, recall 1st, 2nd (Mei et al., 2017) 10 MS FLAIR, T1-w (also with gadolinium) self-organising maps (nerual network) / topographic and quantisation errors 1st, 2nd  69 MS T1-w and FLAIR generative adversarial network b.c. on T1-w DSC, recall, precision, F1 1st, 2nd (Dachraoui et al., 2020) 30 MS T1-w (also gadolinium), T2-w and FLAIR   ( Llado et al., 2012;García-Lorenzo et al., 2012;Alrabai et al., 2022;Zeng et al., 2020;Diaz-Hurtado et al., 2022;Commowick et al., 2023). What is actually lacking is a validation that demonstrates the advantages of automatic methods with respect to the standard procedure. Furthermore, many of the reviewed studies have been performed on data from international challenges, which were to some extent curated and, thus, did not reflect current "real-world" clinical scenarios. Feedback from radiologists and neurologists on clinical data could help methods explore and mitigate potential implementation biases (Vokinger et al., 2021;Varoquaux and Cheplygina, 2022). At the same time, this could change the way the tool is perceived in the clinical environment.
An additional reason may be that latest methods struggle to adapt to the heterogeneity of data acquired in clinical settings. Some recent works attempted at addressing the challenge of the use of images acquired with different contrast mechanisms and in scanners produced by different vendors and with different field strengths Billot et al., 2021). The issue represented by the different spatial resolution of clinical images, leading to variable partial volume effect during resampling, still requires ad hoc solutions and additional validation with on-site data. Also, an ad hoc integration of a method into a single institutional PACS may not generalise well in the case of a multicentric study.
Another possible motivation for the existing gap between development and clinical integration of methods could be the lack of national and international initiatives to promote their translation into clinical practice. In the current situation there is still a pronounced imbalance in favour of challenges supporting technical evaluations. Similar initiatives related to clinical validation and integration would certainly represent a boost in the implementation of solutions for MS lesion segmentation. Research focused on the integration of those methods into the clinical workflow as well as on the evaluation of their performance in a clinical routine setting might substantially help promoting their adoption and use by both neuroradiologists and neurologists.
Moreover, reducing the gap between the methods' development and clinical translation might be highly beneficial also to improve the robustness and minimise the implementation bias of software solutions for MS lesion detection/segmentation. Ultimately, also patients would benefit from a more efficient and trustworthy process supporting disease diagnosis and monitoring of treatment effects.

Conclusions
We systematically reviewed automatic MS lesion detection and segmentation tools to assess their maturity towards clinical integration. Using the six steps of the QNI framework, we examined these quantitative tools' development, validation, and integration level in the clinical workflow. In this review, we focused on the required development towards clinical application of MS lesion segmentation methods, and showed that-to date-there is no consistent evidence of tools' integration into the clinical workflow. Our work demonstrates, therefore, that there is an important gap that needs to be filled by future research in this field. In addition, the socio-economic effects and the impact on patients' management of those tools have yet to be studied.

Data availability
Data will be made available on request.