Where is VALDO? VAscular Lesions Detection and segmentatiOn challenge at MICCAI 2021

Imaging markers of cerebral small vessel disease provide valuable information on brain health, but their manual assessment is time-consuming and hampered by substantial intra- and interrater variability. Automated rating may benefit biomedical research, as well as clinical assessment, but diagnostic reliability of existing algorithms is unknown. Here, we present the results of the \textit{VAscular Lesions DetectiOn and Segmentation} (\textit{Where is VALDO?}) challenge that was run as a satellite event at the international conference on Medical Image Computing and Computer Aided Intervention (MICCAI) 2021. This challenge aimed to promote the development of methods for automated detection and segmentation of small and sparse imaging markers of cerebral small vessel disease, namely enlarged perivascular spaces (EPVS) (Task 1), cerebral microbleeds (Task 2) and lacunes of presumed vascular origin (Task 3) while leveraging weak and noisy labels. Overall, 12 teams participated in the challenge proposing solutions for one or more tasks (4 for Task 1 - EPVS, 9 for Task 2 - Microbleeds and 6 for Task 3 - Lacunes). Multi-cohort data was used in both training and evaluation. Results showed a large variability in performance both across teams and across tasks, with promising results notably for Task 1 - EPVS and Task 2 - Microbleeds and not practically useful results yet for Task 3 - Lacunes. It also highlighted the performance inconsistency across cases that may deter use at an individual level, while still proving useful at a population level.


Introduction
Cerebral small vessel disease (CSVD), the deterioration of the smallest brain vessels, encompasses a large variety of etiologies including arteriolosclerosis (Alistair, 2002) and amyloid pathology (Kester et al., 2014) and may be further driven by genetic predisposition (Haffner et al., 2016;Giau et al., 2019).It results in observable damage or changes to the brain.Most commonly observed MRI markers of CSVD include white matter hyperintensities (WMH), cerebral microbleeds, lacunes of presumed vascular origin, and enlarged perivascular spaces (Wardlaw et al., 2013).CSVD related damage has been associated with an increased risk of stroke and dementia, and with the acceleration of cognitive decline (Østergaard et al., 2016;Rensma et al., 2018).The presence of these markers are also associated to one another (Zhang et al., 2014;Zhu et al., 2010;Yates et al., 2014).
WMH are the most visible marker of CSVD and have naturally taken the centre stage of clinical research in CSVD.In addition, research on development of WMH segmentation solutions has been particularly popularized thanks to impactful research showing the clinical importance of lesion volumetry (Van Straaten et al., 2006).While the automated quantification of white matter hyperintensities has been heavily studied for the last decade with very successful solutions (Sudre et al., 2015;Guerrero et al., 2018;Atlason et al., 2019;De Boer et al., 2009), automated detection and segmentation of the small, focal markers of CSVD has been investigated less frequently.However, as the interest of the clinical community in these markers starts to grow, getting to understand their relevance in clinical research requires them to be adequately detected and quantified.While these markers are currently typically assessed visually through binary dichotomization (presence vs absence) (Yates et al., 2014), counts (Adams et al., 2015), or visual scales (Potter, 2011), such visual assessment is time consuming and subject to large inter-and intra-rater variability (Sudre et al., 2019).Automated methods are therefore required to make quantification robust and reliable as well as feasible in the context of large data sets.So far, development of automated methods has been impeded by the methodological issues related to the very small size of these markers and the resulting extreme imbalance in the data, as well as the absence of a gold standard for annotation.
Methodological developments towards automated solutions for the quantification of biomarkers have found a new dynamic thanks to the annotated datasets made available through technical challenges on segmentation and detection in brain MRI with notably the popular BRATS challenge (Menze et al., 2014), ISLes (Maier et al., 2017), MRBrainS (Mendrik et al., 2015), the 2017 MICCAI WMH challenge (Kuijf et al., 2019) or the more recent ADAM challenge (Timmins et al., 2021) on intracranial aneurysms.Such challenges give insight into state-of-the-art methodology and remaining technical problems for a specific question.
The VAscular Lesions DetectiOn and Segmentation (Where is VALDO? ) challenge was organized with the aim of promoting the development of new solutions for the automated detection and segmentation of these sparse and small structural brain changes (enlarged perivascular spaces (Task 1), cerebral microbleeds (Task 2) and lacunes (Task 3) ) while leveraging weak and noisy labels from manual annotation or visual assessment.Beyond a simple benchmarking exercise assessing the state of the solution space, this challenge was further intended to gain insight on the current Task 1 -EPVS Task 3 -Lacunes Task 2 -Microbleeds Figure 1: Annotated example of the three type of markers targeted in the challenge pitfalls and challenges, raise awareness and contribute to the building of a community dedicated to developing solutions to facilitate quantification of CSVD markers in brain MRI scans.This paper describes the design, results, and lessons learnt through the challenge according to the reporting guidelines detailed in (Maier-Hein et al., 2020).

Mission of the challenge
The Where is VALDO?challenge was organized to assess three tasks, each of them focusing on one focal marker of CSVD -Task 1 on enlarged perivascular spaces (EPVS), Task 2 on cerebral microbleeds and Task 3 on Lacunes. Figure 1 illustrates each of these markers as annotated in the challenge training set.
Currently, the lack of accurate and reproducible automated methods for all three markers prohibits the identification of clinically relevant characteristics at both individual and population levels.
Therefore, for each of the stated markers both detection and segmentation performance need to be assessed.Ultimately, the improved quantification of these small focal markers of CSVD may be used to better understand their relevance and derive biomarkers for diagnosis or prognosis in the context of healthy ageing and dementia, and as surrogate end points in clinical trials.
In proposing tasks particularly subject to high data imbalance and limited and/or noisy annotations, this challenge further aimed to catalyse methodological research to address these common issues in the medical image analysis community.
Ultimately, the proposed methods should be applicable to different settings involving ageing populations such as population cohorts, clinical trials or memory clinics.
The challenge dataset however consisted exclusively of population-based cohorts -two to three according to the task, with differences in MRI acquisition protocol, image resolution and scanner characteristics across datasets.No additional information beyond the images was provided.Each of the datasets was enriched for lesion burden through stratified sampling of the skewed population distributions.
For each task, a similar approach to assessment was adopted to ensure consistency across tasks and address both segmentation and detection aspects, although some may currently be considered more important in one task than another, with different paradigms used in clinical practice.For instance, the blooming effect observed in the presence of microbleeds is protocol dependent, making the detection more relevant than the segmentation in that task (Buch et al., 2017).

Challenge organization
The Where is Valdo?challenge was run as a satellite event at MICCAI 2021 as a collaboration of University College London and Erasmus MC University Medical Center Rotterdam.Its threetask design was peer-reviewed prior to acceptance and made public at https://doi.org/10.5281/zenodo.4600654Regarding prize eligibility, it was decided that organizers would not participate and while members of the same institutions as the organizers were allowed to participate in the challenge, they would not be eligible for prizes.Prizes were given to each winner of individual tasks and the overall winner across all tasks.Results were publicly presented for all participating teams.All submitting teams were invited to propose two team members (per task) to participate as co-authors in the challenge overview paper.After publication of this overview paper of the challenge, the submission will reopen to the community for anyone wanting to benchmark their methods against those previously submitted.Further information is available on the challenge website https://valdo.grand-challenge.org.
The challenge was organized in 4 phases: 1) a training phase from the moment the annotated database was made downloadable (February 2021), 2&3) two optional validation steps on 5 new cases to provide individual (no public leaderboard) feedback on the performance (14 th to 21 st of June and 12 th to 19 th of July) and 4) the final evaluation stage on withheld cases (submission from 26 th of July to 5 th of August 2021).A grace period extending until the 10 th August in case of technical difficulties was granted to all participants.Participants had to provide a docker container for their fully automated method (1 for each task) and were allowed to participate in any or all the tasks.Use of additional training data was allowed under the condition it would be made available at submission time.The methods did not have to be similar across all tasks.Details of the submission procedure are listed at https://valdo.grand-challenge.org/Submission/.Participating teams were also requested to provide a short technical note describing their solutions that have been made available at https://openreview.net/group?id=MICCAI.org/2021/Challenge/VALDO. Figure 2 presents the timeline of the challenge.Submitted data were evaluated on the test set at a GPU facility at Erasmus MC.In order to ensure that the proposed methods were running as expected, each docker was run on one example of the training set and the result sent back to the participants for checking, allowing for submission of a new docker if the output was not as expected.
The evaluation code was made available prior to submission at https://github.com/WhereIsValdo/valdo-eval-2021.The participating teams were encouraged to make their source code publicly available and all participants except one team agreed for their docker containers to be made public.
They have been placed on https://hub.docker.com/r/whereisvaldo/challenge2021/tags The challenge was sponsored by NVIDIA and Icometrix.Test data was available to CHS and KVW.The contribution of the authors listed in this manuscript can be found in supplementary material.

Community survey
To better understand the interest within the community for such initiative, we launched in January 2021 a survey targeting the community working in the field of automated detection of CSVD lesions.This survey was sent to a list of researchers having recently published automated methods for detection or segmentation of one of the three lesion types considered in the Where is VALDO?challenge, the International Society of Vascular Behavioural and Cognitive Disorders (VasCog https://www.vas-cog.com),and the Medical Image Understanding and Analysis (MIUA miua@jiscmail.ac.uk) mailing list, and the survey was shared on social media by the challenge organizing team.Overall, 36 answers were recorded with 25 individuals indicating to be very likely or likely to participate.Among the respondents, 39% indicated being already actively working in the field of CSVD and 30% more general in the neuroimaging field.Microbleed segmentation appeared as the most popular task in the survey with 15 respondents indicating they were highly likely to participate in this task against 10 for EPVS and 10 for lacunes.These answers helped shape the final challenge design, notably standardizing the evaluation of the different tasks and making the challenge overall more concise.

Challenge data sets
The challenge data sets (training, validation, and test sets) came from the same cohorts with a similar ratio between them across tasks.This ratio was also kept in the testing set.

Datasets and image acquisition
Two subsets of population cohorts were used for all three tasks and an additional one was further available for the microbleed detection/segmentation task, namely the SABRE and Rotterdam Scan Study (RSS) cohorts and the ALFA study respectively.All cohorts were retrospective studies for which local ethical approval had already been obtained from the National Research Ethics Service Committee, London-Fulham (14/LO/0108) for SABRE, the Population Research Act from the Ministry of Health for RSS and the Independent Ethics Committee Parc de Salut Mar Barcelona and registered at Clinicaltrials.gov(NCT01835717) for ALFA.For all datasets, acquisition of the data was performed by a trained radiographer according to a predefined research protocol.The training data for the Where is VALDO?challenge was made available under a CC BY NC-SA license.

SABRE.
The Southall and Brent Revisited (SABRE) cohort is a population cohort of individuals residing in the two named boroughs of west London (UK) (Tillin et al., 2013).This tri-ethnic cohort was initially recruited in 1988 with the purpose of investigating metabolic and cardiovascular diseases across ethnicities.For their third clinical visit (2014)(2015)(2016)(2017)(2018), life partners were also invited to take part and study participants underwent a brain MRI session on a Philips 3T scanner.Mean age in this cohort at time of acquisition was 72 years old ranging from 36 to 92.

RSS.
The Rotterdam Scan Study (RSS) (Ikram et al., 2015) is part of the larger Rotterdam Study (RS) (Ikram et al., 2020), a population-based study that aims to investigate chronic illness in the elderly.Started in 1995, the Rotterdam Scan Study initially concerned a selection of the RS but since 2005 brain MRI is part of the core protocol of the study.Individuals aged 45 and over without dementia are eligible for MRI and are followed up every 3-4 years.Since 2005, scanning has been performed on a 1.5T GE MRI scanner dedicated to the study.
ALFA.The ALFA (Alzheimer's and Families) cohort is based on the ALFA registry that gathers details of relatives (generally offspring) of patients with Alzheimer's Disease making up for a cohort naturally enriched for genetic predisposition to AD.As described in the related protocol paper (Molinuevo et al., 2016), the ALFA cohort is composed of cognitively normal participants aged 45-74.Brain MRI sequences were acquired on a GE Discovery 3T scanner.
Table 1 summarizes the acquisition parameters for the different sequences across the studied cohorts.

Training, validation and testing data
For Task 1 -EPVS and Task 3 -Lacunes, imaging data consists of T1-weighted, T2-weighted and FLAIR images, with the latter two modalities rigidly registered to the T1 image using NiftyReg (Modat et al., 2014) The number of cases proposed for training was chosen based on annotation availability and data policy for making a certain number of cases publicly available.For Task 1 -EPVS and Task 3 -Lacunes, the SABRE segmentation data was already available for a set of 16 cases with high level of cerebrovascular damage.In comparison, for the RSS study, for which annotations were more widely available, data were selected to cover the variability in burden present in the study.They present close to a uniform distribution in burden thereby limiting data skewness towards cases without any lesion.In all tasks, annotated cases were distributed across training and testing set to follow approximately similar burden distribution.A ratio of 6:10 between training and testing data was chosen across all cohorts and tasks.

Annotation
Across the three cohorts, raters were all trained for their annotation task and had at least 3 years of professional experience in dealing with medical images.The segmentation was performed for all SABRE and ALFA cases using ITKSnap (Yushkevich et al., 2016).For the RSS cases a custom MeVisLab (Ritter et al., 2011) application was used.In all cases were two annotations were available, the average of the two annotations was used as reference.
Task 1 -Enlarged Perivascular Spaces.For Task 1, the annotation strategy differed between the SABRE and RSS cohort.For identifying EPVS, the STRIVE criteria (Wardlaw et al., 2013) for EPVS were used in the SABRE cohort, while in the RSS cohort, the UNIVRSE criteria (Adams et al., 2015) were used.These criteria are very similar, except for the fact that the UNIVRSE criteria only consider EPVS with a diameter between 1 and 3 mm, while the STRIVE criteria do not have a lower limit and consider any EPVS with a diameter up to 3 mm.In the SABRE cohort, EPVS over the whole brain image were annotated independently by two raters (CHS and LL) with a senior radiologist (BGA) confirming the segmentation of CHS.The three modalities were jointly used for the segmentation that was assessed across the three axes.
For this dataset the annotation was provided in either of two forms: over the full brain or on only 5 randomly selected slabs of 5mm.A mask was provided per case indicating the slabs that were annotated.
In the RSS cohort, EPVS were annotated with segmentations in limited axial slices for 6 cases of the training set and the full test set, while the remaining 28 cases of the training set were annotated with dots only by a team of trained annotators supervized by KVW, FD and MWV.EPVS were annotated in four brain regions: the mesencephalon, hippocampus, basal ganglia, and the centrum semi-ovale.The first two smaller regions were annotated entirely.For the latter two regions, only one fixed slice was annotated.For the cases with EPVS segmentations, additional slices of the basal ganglia and of the white matter were annotated, the depth of these axial slices was randomly chosen per case.A mask indicating which parts of the brain had been annotated was computed using parcellation outputs for each case.
For the training data made available to participants, the EPVS annotations were either presented just as counts (computed from the dots), per slice and per region or as segmentations plus counts in the same areas.The masks indicating the annotated regions and slices per case was also provided.
Figure 3 illustrates the type of annotation masks that were provided to the participants.
Task 2 -Microbleeds.Different raters annotated each of the cohorts but followed very similar protocols.The BOMBS criteria (Cordonnier et al., 2009) was applied for the SABRE (RR under the supervision of HRJ) and ALFA cohort (consensus of SI and LL under the supervision of FB) as described in (Ingala et al., 2020).A team of trained raters under the supervision of MWV applied the protocol described in (Vernooij et al., 2008) for RSS.Both identification protocols are in line with the STRIVE guidelines (Wardlaw et al., 2013) that indicate that microbleeds are areas of signal void of generally 2-5 mm in diameter but can be up to 10 mm.Task 3 -Lacunes.Lacunes were identified using the STRIVE criteria (Wardlaw et al., 2013).Cerebellar lacunes were excluded because of assumed differences in the underlying pathology in this brain region (Sigurdsson et al., 2022).Any surrounding gliosis (the hyperintense rim visible on FLAIR sequences) was not included in the segmentation of the lacune.For the SABRE cohort, lacunes were identified at the same time as EPVS simply being assigned another label in the segmentation, with the two raters (CHS, LL) performing the identification and segmentation independently.For the RSS cohort, lacunes were independently segmented for all cases by two raters, the pair of raters varying across the cases.In RSS, all cases of training, validation and test set indicated by radiological reads as containing at least one lacune were consistently annotated by one rater (TE) on a custom MeVisLab (Ritter et al., 2011) application.The second set of annotations was performed using ITKSnap (Yushkevich et al., 2016).PY annotated all cases of the training set.FW annotated the validation set as well as half of the test set.The remaining half of the test set was annotated by IFV.

Sources of annotation errors
In all tasks, possible source of errors in the annotations pertain to multiple distinct sources: the appropriate identification of a target element either because these elements are very small and may be easy to miss or because it may be difficult to distinguish them from similarly appearing structures (mimics); the decision on the boundary of an object, probably notably more complex in a coarser resolution plane; the use of the segmentation software (too large brush, not considering all orientations for consistency or not adequately using the zoom).In the case of EPVS, identification of "large enough" marker was also a subjective consideration possibly leading to different detection levels.

Preprocessing
For all tasks, the preprocessing consisted of a rigid alignment of the images as indicated in section 2.4.2.A defacing mask derived from the T1-weighted image was applied to all registered modalities.While such a step would not be required in practice, this step was mandated by the data sharing policies around public release of the data.The defacing mask was obtained as the inverse of a dilated version of the brain mask as obtained from HD-BET (Isensee et al., 2019).All RSS scans were corrected for intensity inhomogeneity with the default parameters of MINC N3 package (Sled et al., 1998).

Assessment method
All three tasks were evaluated using similar metrics in order to assess both detection and segmentation performance of the proposed solutions.A combination of relative error (F1 score and Mean Dice score) and absolute error (absolute element difference (AED) and absolute volume difference (AVD)) metrics was chosen, since they provide complementary information.The F1 score and the AED on the number of detected lesions were chosen as detection metrics while the Mean Dice score over the appropriately identified elements and the AVD were the metrics used for the evaluation of segmentation.Table 3 summarizes the purpose, formula and properties of the metrics used in the challenge across all tasks and calculated for each case, where c refers to 6-neighborhood connected components, TP to true positives, FP to False positive, FN to false negatives, Ref to the reference annotation and Seg to the predicted segmentation.
One essential aspect in the evaluation for the derivation of both F1 and Mean Dice score was the definition of true positive elements.To determine which of the elements were true positives, for all three tasks, connected components with a neighbourhood of 6 were established for both annotation and prediction using a threshold for the probability of 0.5 for the prediction map.Each annotation element was matched to at most one element from the prediction.For Task 1 -EPVS, a possible matchable element had to have an Intersection over Union (IoU) of more than 10%.
For Task 2 -Microbleeds and Task 3 -Lacunes, matching was possible when the centre of mass of the prediction element was less than 5 mm away from the center of mass of the ground truth segmentation element.When multiple elements were found to be matchable, the one with best association value (IoU or centre of mass distance) was attributed to the annotated label.For empty cases, the relative metrics were inapplicable, so only the absolute error metrics (number of elements and volume) were computed.
In the event of algorithmic failure for a specific case, worst metric values were attributed.For bounded metrics (F1 and Mean Dice score) a value of 0 was given.For non-bounded error metrics (absolute element and absolute volume difference) an error of 100 000 was assigned as worst possible error.For Task 3 -Lacunes two metrics related to the estimation of uncertainty were further included.
One was designed to tackle detection uncertainty and the other segmentation uncertainty.In terms of uncertainty validity, elements are considered as either truly certain (TC), truly uncertain (TU), falsely certain (FC) or falsely uncertain (FU) as per Table 4.
The uncertainty was calculated as (T The segmentation uncertainty was only assessed over true positive detected elements, assessing probabilistic uncertainty accuracy as All metrics were computed per image and the distribution over all cases of the test set was used for the final ranking.For each task, ranking of the methods was performed following the method described for the Medical Image Decathlon challenge (Antonelli et al., 2021).Pairwise comparisons were performed using the Mann-Whitney U-test for the Mean Dice over cases with F1 > 0 and the Wilcoxon paired test for the other metrics due to their non-normal distribution.
For each method, the number of times it was found significantly better (with a p-value ≤0.05 for significance) than another was used to rank the given metric.The final rank was obtained as the average across the ranks (lower being better).The robustness of the ranking was further assessed using the distribution of Kendall's tau correlation coefficient between ranking for all cases and the one obtained for 1000 bootstrap samples as described in (Wiesenfarth et al., 2021).
To identify the best overall team, the ranks were averaged across all common metrics of all tasks for the teams that provided a solution to all three tasks.
Clinical performance.For each task, the most clinically relevant metric was further defined and used to compare the different methods.For Task 1 -EPVS, to emphasize the notion of burden of EPVS, the correlation between predicted and reference volumes across the population of test cases was used.For Task 2 -Microbleeds and Task 3 -Lacunes where a binary statement of existence or absence is most clinically relevant, the balanced accuracy over cases considered as a whole-image classification task was chosen.
Cross-dataset performance.For each task, the performance of each method was additionally computed per dataset and then compared.The ranking was also computed per dataset to examine specific discrepancies between cohorts.
Regional performance.To assess whether the performance of the proposed methods differed depending on the region for Task 1 -EPVS, the evaluation was run for each region (centrum semi-ovale, basal ganglia, hippocampus and mesencephalon) separately.For each method, pairwise comparison across regions was performed to assess whether a given method performed better on a given area.
The overall ranking between methods was also computed per region.
Inter-rater variability.For Task 1 -EPVS and Task 3 -Lacunes for which annotations by two raters were available, the evaluation was run considering alternatively each rater as the reference.
While the overall absolute differences (volume and number of identified components) between the two raters are independent of the reference chosen (rater 1 or rater 2), changing the reference will affect F1 score and Mean Dice calculation due to differences in definition of true positives.
Ensemble performance.Two ensemble solutions were created and evaluated.The average of all solutions (EnsembleAll) and the average of the predictions from the top 50% in overall rank of the methods (EnsembleTop).EnsembleAll and EnsembleTop were compared to the individual methods for each task.The number of participating teams being 4 for Task 1 -EPVS, EnsembleTop in this case consists in the union of two best performing methods.

Challenge submission and participating teams
Over the period of the challenge, the data set has been requested for 353 downloads.Across the two validation periods, we received requests from 1 team at validation stage 1 and 4 teams at validation stage 2. The final submission of dockerized solutions and their documented description to be applied to the test sets was composed of 4 teams for Task 1 -EPVS, 9 teams for Task 2 -Microbleeds and 6 teams for Task 3 -Lacunes.Only 2 teams participated in all 3 tasks.Table 5 summarizes in which task each team participated.
Table 6 reflects for each task and team the average time needed to evaluate one case, the GPU memory consumption, the docker details for memory requirements (CPU/GPU) and the methods' Mask-RetinaNet (Farady et al., 2020), ResNet (He et al., 2016).Beyond the well-known Dice (Milletari et al., 2016) and binary cross-entropy losses, others such as focal loss (Lin et al., 2017) and blob loss (Kofler et al., 2022) were mentioned.Adam (Kingma and Ba, 2014), SGD (Gardner, 1984) and Ranger21 Wright and Demeure (2021) were the optimizers used.Among all the submissions, only one team (TheGPU) proposed an alternative to a deep learning solution.The majority of the proposed methods were trained as pure segmentation solutions and a few teams submitted a detection+segmentation solution based on Mask-RCNN (He et al., 2017) or Mask Retina net (Farady et al., 2020).Across all tasks, when a deep learning solution was proposed, the UNet architecture was the most common choice.For all three tasks, the time required to process a case and the GPU memory requirements varied greatly.For Task 2 -Microbleeds for instance duration ranged from less than 1 minute to 45.8 min and memory consumption of 2.4 to 43 GB (allowing for memory flooding).In terms of the methodology for uncertainty assessments in Task 3 -Lacunes, the two teams submitting methods to all three tasks did not provide any uncertainty map.Among the 4 remaining teams, most used directly the probabilistic value of their output as measure of uncertainty while mixLacune defined an uncertainty zone at the border of their detected lacunes.
For all teams, key characteristics of the proposed methods are summarized in table 6.Additional details can be found for each team on the OpenReview repository https://openreview.net/group?id=MICCAI.org/2021/Challenge/VALDO.

Metric values
For each task the detection and the segmentation are reported across all teams.
Task 1 -Enlarged Perivascular Spaces (EPVS).The summary statistics for each team and each metric are reported in Table 7.  Task 3 -Lacunes.Table 9 presents the results obtained for Task 3 -Lacunes.while Table 10 shows the metrics for the uncertainty component of the task excluding BigrBrain and TeamTea who did not provide an uncertainty map.  Figure 6 presents the distribution of metrics values for detection (top row) and segmentation metrics (bottom row) for Task 3 -Lacunes.
Figure 7 shows the distribution of metrics values for the assessment of uncertainty applied for Task 3 -Lacunes.

Rankings
Table 11 presents the overall ranking, according to the number of tasks undergone and for each individual task when relevant.Table 12 reflects the distribution of Kendall's Tau coefficient when assessing the robustness of

Clinical relevant markers
Task 1 -EPVS.For Task 1, since the burden of PVS is currently clinically considered the most valuable insight, the Spearman correlation coefficient between predicted and reference burden across all test cases was calculated for overall volume and element count and is presented in Figure 8 along with the log-transformed relationship between reference and predicted burden in terms of volume (top) and count (bottom).Task 2 -Microbleeds.For cerebral microbleeds, classifying the absence or presence of any microbleeds was deemed clinically the most relevant assessment.Balanced accuracy over the test set varied from 29.5% for team Dawai to 87.3% for team MixMicrobleed. Figure 9 presents the confusion matrices for each of the teams.
Task 3 -Lacunes.Similarly, Figure 10 shows the confusion matrix for correctly identifying cases that have at least one lacune.For the 6 participating teams, balanced accuracy was close to 0.5 for almost all teams as they predicted the presence of at least one lacune in almost all cases.Only TeamTea was able to recognize cases without lacunes, with 78.3% balanced accuracy.

Cross-dataset variability
Performance varied greatly across datasets, being systematically overall better on RSS dataset than others (SABRE or ALFA).For all three tasks, Figure 11 presents the variation of F1 and Mean Dice score across datasets for all teams and Table 13 presents median and interquartile range for all tasks across datasets for F1 score and Mean Dice.
Ranking varied also slightly across datasets as indicated in Table 14.

Inter-rater variability
Inter-rater variability was investigated for tasks and datasets for which two raters provided annotation for the same case (Task 1 -EPVS SABRE dataset, Task 3 -Lacunes all datasets) and results are presented in Table 15.
For Task 1 -EPVS, intra-rater detection was slightly lower than the best method but the interrater segmentation performance appeared to be better by quite a strong margin reaching 59.49% in comparison to the best method at 45.5%.The detection performance was notably higher for Task 3 -Lacunes with segmentation performance on par with the best performing method.
Table 16 presents the values of the metrics and the corresponding ranking obtained for each type of ensemble (EnsembleAll, the average of all solutions, and EnsembleTop, the average of the top 50%) across the three tasks.When considering the clinical metrics, performance was higher for both ensemble solutions in Task 1 -EPVS reaching a correlation coefficient of 70.0% and 74.8% for EnsembleAll and EnsembleTop respectively for the count and 69.5 and 80.0% for the volume.For Task 2 -Microbleeds, balanced accuracy was of 77.0% for EnsembleAll and 79.6% for EnsembleTop ranking fourth and third compared to all the teams.Finally, for Task 3 -Lacunes, balanced accuracy reached 75.0% for EnsembleAll, down to 65.3$ for EnsembleTop slightly lower than the 78.0%obtained by TeamTea.

Discussion
This manuscript reports the design and outcome of the "Where is VALDO?" challenge that took place as a satellite event of MICCAI 2021.Detection and segmentation of three types of markers of cerebral small vessel disease were evaluated as three distinct tasks namely enlarged perivascular spaces (Task 1), cerebral microbleeds (Task 2) and lacunes (Task 3).Among the 12 distinct participating teams, 9 teams provided a solution for Task 2 and 2 teams competed across all three tasks.
Although the challenge was designed to address both detection and segmentation aspects, most of the proposed solutions were designed with a segmentation purpose only -the detection performance considered as a by-product of the prediction.This choice may have been influenced partially by the guidelines to provide only the probabilistic segmentation map that was then post-processed to identify the individual connected components instead of requesting instance segmentation and predicted detections as outputs.However, this strategy appeared to generally work well with segmentation performance being on par with detection performance across all three tasks.Interestingly, there was no strong relationship between memory, time expenditure and overall performance with some of the most greedy methods having lower performance than some of the most cost-effective solutions.
Across all tasks, one team proposed a solution not relying on deep-learning and their strategy had the best performance for Task 1 -EPVS possibly because of the fact that EPVS may be relatively easy to characterise in terms of signal and shape signature.However, none of the proposed methods for Task 1 -EPVS made use of the weak annotation data (count on slices).Also, while some methods only used annotated slices, performance may have been lowered by the absence of use of the masks when only specific parts of a given axial slice were annotated (RSS Data).Most deep learning solutions described using a UNet style architecture at one point of their pipeline either as main network for one-stage methods or for the segmentation component for multi-stage solutions.
Interestingly, despite four teams describing the use of the nnUNet (Isensee et al., 2021) architecture for Task 2 -Microbleeds, performance varied greatly across these teams with rank 1, 2, 5 and 7 out of 9.This could potentially be explained by the choice of input data, the dimensionality, or the framework chosen.In the context of microbleeds, using 3D information may be particularly relevant to avoid mimics.This observation highlights the importance of all these steps in the design of a relevant solution, the use of the whole extent of the training data being a key component of the winner's method.Such consideration is particularly relevant when dealing with a modest number of training examples.When considering choices of augmentation, those involving local changes to input images and/or reference annotation (interpolation, intensity changes, spatial deformation) may cause inconsistencies in the case of very small objects of interest.
In terms of dataset origins, performance was generally higher for the dataset with the highest resolution which was also for Task 2 -Microbleeds and Task 3 -Lacunes the dataset with the highest number of training cases.This is naturally expected as a direct impact on resolution on evaluation metrics and as an overfitting related property.
The amount of training data (in terms of examples of lesions) appeared also to be relevant when comparing the performance of the methods of Task 1 across the different regions of interest, the regions with the most EPVS (centrum semi-ovale and basal ganglia) being the ones with the highest performance across all methods.This may not only be due to the sheer amount of training data in the remaining regions (hippocampus and mesencephalon) but also to the characteristics of the imaging sequences in these regions and the likelihood for mimics (cysts) and higher variability in presentation.Knowledge of the differences in performance across regions is particularly interesting clinically when associations with risk factors and or clinical function have been made specifically in specific anatomical regions in relation to Alzheimer's Disease (Jiménez-Balado et al., 2018) and Parkinson's disease (Duker and Espay, 2007).For Task 1 -EPVS, even for the best teams, the performance presented a large variability which would make their adoption in clinical practice difficult.The overall good correlation between expected and predicted burden may however already be enough to make these tools valuable when investigating associations at population level.For Task 2 -Microbleeds, it appeared that, when correctly detected, the segmentation of lesions was very good.However, even in the best of teams there were issues at the detection level with both cases missed and cases wrongly considered as containing at least one microbleed.The best teams indicated very few lesions which would be relatively practical to visually inspect and reject if necessary.It is here the absence of a systematic bias towards overcall or undercall could make it difficult to integrate in clinical pipelines.For Task 3 -Lacunes, performance appeared quite poor on both detection and segmentation metrics, with a general large overcall of lacunes and when detecting them correctly a lower segmentation performance than for Task 2 -Microbleeds.Such performance would require too much time for editing and checking to be adopted in both clinical practice and research studies.
When comparing the performance across all three tasks, it appeared that the performance was higher on tasks for which the variability in element appearance was lower (EPVS with linear shapes and microbleeds with spherical shapes compared to lacunes with more heterogeneous shapes).The metrics investigated as closest to the current clinical measures of interest were generally in agreement with the overall ranking of the challenge but showed stark differences in terms of clinical viability of the proposed solutions.While for Task 1 -EPVS and Task 2 -Microbleeds the proposed solutions achieved reasonable performance in terms of "clinical" metric, only one team performed reasonably well for Task 3 -Lacunes, with all other solutions systematically finding many lacunes even when there were none.This may be due to the large variability in appearance (i.e.shape, location, intensity signature) as well as the lower number of examples of this type of lesions when compared to those of Task 1 -EPVS and Task 2 -Microbleeds.With all solutions generally producing many false positives, the time required to go through each case and reject many wrongly detected lesion candidates would be prohibitive for clinical adoption.One must however keep in mind that none of these solutions were optimized for this metric and may have performed differently otherwise.In this case the addition of auxiliary tasks in the learning framework to abide to a priori knowledge of burden distribution or to directly optimize such metrics may have interesting results.
In a field where adequate research biomarkers have yet to be properly defined and proven to be reliable (Smith et al., 2019), these observations regarding clinical metrics may lead to define different tasks and solutions for the targeted markers according to their purpose: clinical practice or research.
While location, individual volume and shape information may become of interest in the research context as potential new biomarkers, thereby highlighting segmentation as an interesting end-goal, these characteristics may not be yet relevant in the clinical context.In clinical practice, one could imagine a two-stage pipeline with 1) whole-image level classification favouring sensitivity for the flagging of scans where an assessment is required for the presence or absence of a specific marker 2) Specification of lesion location (if needed) for the scans that have been flagged as containing a marker.This second step may be particularly relevant when supporting diagnosis (e.g., distinction between amyloid angiopathy and hypertensive pathology according to microbleed location) or to the explanation of the clinical presentation (e.g., lacune on crucial white matter tract).
A key aspect, not measured here, is the ability of the proposed methods to be used in clinical settings with scans likely to be of lower resolution and to have more artefacts as well as present simultaneously other markers of pathology (e.g stroke, tumours).With the continuous progress in acquisition protocols and the democratization of scanning abilities, research-grade scanning protocols such as those used in this challenge may become available routinely, thereby limiting issues of protocol related generalizability.However, cohort-related bias may be more difficult to overcome.
In fact, in the challenge, data came only from population cohorts and did not include patients with dementia as would be frequent in memory clinics.While efforts were made to provide training examples from the whole spectrum of lesion burden, specific pathological presentations may be missing and the generalizability of the proposed solutions would need to be assessed in these contexts.

Conclusion.
In this challenge assessing the current segmentation and detection performance of three markers of cerebral small vessel disease, namely EPVS, Microbleeds and Lacunes, methods targeting directly the segmentation were often quite successful in detecting these small structures.Number

Figure 2 :
Figure 2: Timeline of the challenge from inception in September 2019

Figure 3 :
Figure 3: Example of annotation provided for Task 1 -EPVS with left) for SABRE slabs of 5 mm randomly selected or full segmentation over the image, middle) Segmentation on two slices of CSO, 2 slices of the basal ganglia, the hippocampi and mesencephalon for 6 RSS cases and right) count of EPVS on 1 slice of CSO, 1 slice of basal ganglia, hippocampi and mesencephalon for 28 cases of RSS.

. 9 .
characteristics.The memory details are presented both as requested by the participants based on their training settings and as measured on a single case allowing for memory flooding.All methods using Stochastic Gradient Descent (SGD) as optimizer applied Nesterov Momentum with value of Participation of the teams across the different tasks 0.99.Poly learning rate scheduling is defined as multiplying the learning rate by 1 − epoch epochmax 0The following architectures were listed by the participating teams: 2D Unet(Ronneberger et al., 2015), 3D Unet(C ¸içek et al., 2016), nnUnet(Isensee et al., 2021),MaskRCNN He et al. (2017),

Figure 4 :
Figure 4: Distribution of metrics values across the different teams for detection metrics (top row) and segmentation metrics (bottom row) for Task 1 -EPVS

Table 11 :
Ranking across all tasks grouped by number of tasks to which each team participated.Across all metrics, D refers to detection and S to segmentation, R to relative, A to absolute and U to uncertainty.DR refers to F1 score, DA to Absolute element difference, SR to Mean Dice, SA to absolute volume difference, DU to detection uncertainty and SU to segmentation uncertainty.Tot is the overall rank for a given task Task 1 -EPVS Task 2 -Microbleeds Task 3 -Lacunes Team DR DA SR SA Tot DR DA SR SA Tot DR DA SR SA DU SU Tot

Figure 6 :
Figure 6: Distribution of metric values across the different teams for detection metrics (top row) and segmentation metrics (bottom row) for Task 3 -Lacunes

Figure 8 :
Figure 8: Association between reference and predicted PVS burden across the participating teams for volume (top row) and count (bottom row).The Spearman rho (%) is indicated on each plot.

Figure 9 :Figure 10 :
Figure 9: Confusion matrix regarding the classification of an image as containing at least one microbleed based on obtained prediction images.

Figure 11 :
Figure 11: Distribution of results for F1 (left column) and Mean Dice (right column) across different datasets for the three tasks (each row represents one task).

Figure 12 :
Figure 12: F1 and Mean Dice distribution across the different brain regions for Task 1 -EPVS of elements on which to train the solutions was strongly predictive of performance, both across tasks and regionally.Manually engineered features became in the case of EPVS relevant enough to compete with deep-learning based strategies.Strikingly, all the presented methods proposed a training based on dense labelling, discarding the weak labelling available for Task 1 -EPVS.While for Task 1 -EPVS and Task 2 -Microbleeds some demonstrated they could potentially be used for population-based research, the large variability in performance across cases may require lengthy visual censoring if they were to be used for individual cases.In this context, it could be relevant to further include the evaluation of performance variability in the assessed tasks.In addition, systematic assessment of prediction confidence (as proposed with the uncertainty metrics of Task 3 -Lacunes) would be of interest for the design of practical implementation.Funding.The challenge prizes were provided by Nvidia and Icometrix.The SABRE study was funded at baseline by the Medical Research Council, Diabetes UK, and British Heart Foundation and at follow-up by the Wellcome Trust (082464/Z/07/Z), British Heart Foundation (SP/07/001/23603, PG/08/103, PG/12/29/29497 and CS/13/1/30327) and Diabetes UK(13/0004774).The Rotterdam Scan Study is supported by the Erasmus MC University Medical Center, the Erasmus University Rotterdam, the Netherlands Organization for Scientific Research (NWO) Grant 918-46-615, the Netherlands Organization for Health Research and Development (ZonMW), the Research Institute for Disease in the Elderly (RIDE), and the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 601055, VPH-DARE@IT, the Dutch Technology Maria González de Echavarri-Gómez, Oriol Grau-Rivera, Laura Hernandez, Gema Huesa, Jordi Huguet, Paula Marne, Marta Milà-Alomà, Tania Menchón, Carolina Minguillon, Arcadi Navarro, Grégory Operto, Eva M Palacios, Eleni Palpatzis, Cleofé Peña-Gómez, Albina Polo, Sandra Pradas, Blanca Rodríguez-Fernández, Aleix Sala-Vila, Gonzalo Sánchez-Benavides, Gemma Salvadó, Mahnaz Shekari, Anna Soteras, Laura Stankeviciute, Marc Suárez-Calvet, Marc Vilanova, Natalia Vilor-Tejedor.

Table 3 :
Description of detection and segmentation metrics used across all tasks for the evaluation.

Table 6 :
Details of the methods of the participating teams for each task.

Table 7 :
Figure 4 presents the distribution of metrics values for detection (top row) and segmentation metrics (bottom row) for Task 1 -EPVS.
Metrics results for Task 1 -EPVS presented as Median [1st quartile -3rd quartile] for all metrics.AED -Absolute Element Difference; AVD (in mm3) -Absolute Volume Difference.In bold the significantly best performance across the different teams (excluding the ensemble solutions) and in italic when there is no significant difference compared to the second best.
AED -Absolute Element difference; AVD -Absolute volume difference (in mm3).In bold, the significantly best performance per metric across teams (excluding the ensemble solutions).Task 2 -Microbleeds.Figure5presents the distribution of metrics values for detection (top row) and segmentation metrics (bottom row) for Task 2 -Microbleeds with Table8presenting the metrics values across all teams.

Table 9 :
Metrics results for Task 3 -Lacunes presented as median [1st quartile ; 3rd quartile].AED -Absolute Element difference; AVD -Absolute volume difference (mm3).Bold font indicates best performance across the teams (excluding ensemble solutions) when significantly better than all others.Italic font indicates best performance when not significantly better than the second ranking

Table 12 :
Distribution (mean and standard deviation) Kendall's Tau correlation coefficient in % between final ranking and bootstrap samples (1000 samples).Across all metrics, D refers to detection and S to segmentation, R to relative, A to absolute and U to uncertainty.DR refers to F1 score, DA to Absolute element difference, SR to Mean Dice, SA to absolute volume difference, DU to detection uncertainty and SU to segmentation uncertainty.

Table 14 :
Ranking calculated for each dataset separately

Table 15 :
presented for the cases where a double rating was available in the test set.

Table 16 :
Metrics value presented as median[IQR]for the 4 common metrics across the different ensemble types for the three tasks along with associated ranking