Comparing methods of detecting and segmenting unruptured intracranial aneurysms on TOF-MRAS: The ADAM challenge

would therefore beneﬁt from a reliable automatic UIA detection and segmentation method. The Aneurysm De- tection and segMentation (ADAM) challenge was organised in which methods for automatic UIA detection and segmentation were developed and submitted to be evaluated on a diverse clinical TOF-MRA dataset. A training set (113 cases with a total of 129 UIAs) was released, each case including a TOF-MRA, a structural MR image ( T1, T2 or FLAIR), annotation of any present UIA(s) and the centre voxel of the UIA(s). A test set of 141 cases (with 153 UIAs) was used for evaluation. Two tasks were proposed: (1) detection and (2) segmentation of UIAs on TOF-MRAs. Teams developed and submitted containerised methods to be evaluated on the test set. Task 1 was evaluated using metrics of sensitivity and false positive count. Task 2 was evaluated using dice similarity coeﬃcient, modiﬁed hausdorﬀ distance (95 th percentile) and volumetric similarity. For each task, a ranking was made based on the average of the metrics. In total, eleven teams participated in task 1 and nine of those teams participated in task 2. Task 1 was won by a method speciﬁcally designed for the detection task (i.e. not participating in task 2). Based on segmentation metrics, the top two methods for task 2 performed statistically signiﬁcantly better than all other methods. The detection performance of the top-ranking methods was comparable to visual inspection for larger aneurysms. Segmentation performance of the top ranking method, after selection of true UIAs, was similar to interobserver performance. The ADAM challenge remains open for future submissions and improved submissions, with a live leaderboard to provide benchmarking for method developments at https://adam.isi.uu.nl/ .

would therefore benefit from a reliable automatic UIA detection and segmentation method. The Aneurysm Detection and segMentation (ADAM) challenge was organised in which methods for automatic UIA detection and segmentation were developed and submitted to be evaluated on a diverse clinical TOF-MRA dataset.
A training set (113 cases with a total of 129 UIAs) was released, each case including a TOF-MRA, a structural MR image ( T1, T2 or FLAIR), annotation of any present UIA(s) and the centre voxel of the UIA(s). A test set of 141 cases (with 153 UIAs) was used for evaluation. Two tasks were proposed: (1) detection and (2) segmentation of UIAs on TOF-MRAs. Teams developed and submitted containerised methods to be evaluated on the test set. Task 1 was evaluated using metrics of sensitivity and false positive count. Task 2 was evaluated using dice similarity coefficient, modified hausdorff distance (95 th percentile) and volumetric similarity. For each task, a ranking was made based on the average of the metrics.
In total, eleven teams participated in task 1 and nine of those teams participated in task 2. Task 1 was won by a method specifically designed for the detection task (i.e. not participating in task 2). Based on segmentation metrics, the top two methods for task 2 performed statistically significantly better than all other methods. The detection performance of the top-ranking methods was comparable to visual inspection for larger aneurysms. Segmentation performance of the top ranking method, after selection of true UIAs, was similar to interobserver performance. The ADAM challenge remains open for future submissions and improved submissions, with a live leaderboard to provide benchmarking for method developments at https://adam.isi.uu.nl/ .

Introduction
Approximately 3% of the world general population have an unruptured intracranial aneurysm (UIA) ( Vlak et al., 2011 ). For some risk groups they are even more common, with a prevalence of approximately 10% in individuals with a positive family history for aneurysmal subarachnoid haemorrhage (aSAH) ( Bor et al., 2014 ). Rupture of an intracranial aneurysm causes an aSAH which is a severe type of stroke. Approximately one-third of patients die, and another third have longterm, life-changing disabilities ( Keedy, 2006 ;Nieuwkamp et al., 2009 ). During screening, it is important that UIAs are detected early, to allow for a treatment decision to be made. From diagnosis, the risk of growth and rupture of the UIA can be determined based on accurate measurement and assessment ( Backes et al., 2017 ;Greving et al., 2014 ). If an aneurysm has high risk of rupture it will be treated preventively. Aneurysms with a lower rupture risk will be followed-up with imaging and carefully monitored to assess aneurysm growth which is an important determinant for aneurysm rupture ( Backes et al., 2015 ). This allows informed treatment decisions to be made . Due to the increasing availability and quality of brain imaging, the number of incidentally discovered UIAs is increasing, and follow up imaging is usually performed ( Brown and Broderick, 2014 ;Nakagawa et al., 2019 ). Also, screening for UIAs with MRA is increasing with knowledge of risk factors for UIA presence. Screening for UIAs with MRA has been shown to be cost-effective in persons with a positive family history for aSAH and in persons with autosomal dominant polycystic kidney disease ( Bor et al., 2010 ;Flahault et al., 2018 ;Hopmans et al., 2016 ). The most common imaging techniques for monitoring UIAs are contrast-enhanced computed tomography angiography (CTA) and non-contrast 3D time-offlight magnetic resonance angiography (TOF-MRA). TOF-MRA is well suited for routine follow-up imaging as it does not need contrast agent or radiation ( Lane et al., 2015 ).
The detection and measurement of UIAs can be difficult and it has been reported that approximately 10% of all UIAs are missed during screening ( Forbes et al., 1996;Kim et al., 2017 ;Keedy, 2006 ;White et al., 2000 ). Detection is particularly difficult for small UIAs and detection by radiologists from MRAs of UIAs < 5 mm on MRAs can have a sensitivity as low as 35% ( White et al., 2001 ). However, detection by radiologists is improving as MRA scan resolution is increasing, especially with higher field strengths ( HaiFeng et al., 2017 ;Wrede et al., 2017 ). In clinical practice, aneurysm detection is performed by a radiologist carefully searching through the axial slices of the TOF-MRA, often combined with coronal and sagittal multi-planar reconstructions, a maximum intensity projection (MIP) or 3D volume reconstruction, before making 2D size measurements of the aneurysm.
As more individuals are followed-up or screened, the speed of clinical workflow could be increased with automatic methods of detection and quantification of UIAs from TOF-MRAs. However, it is important that these methods do not compromise the accuracy of human observers for the detection and measurement of UIAs. Automated volumetric segmentation of UIAs would enable 3D quantification of UIAs and may aid the prediction of UIA rupture risk. For example, it is known that the shape of an UIA, such as non-spherical and lobular shape, are related to an increase in growth and rupture risk ( Backes et al., 2017 ;Lindgren et al., 2016 ;Raghavan et al., 2009 ). Furthermore, quantified shape measurements of the UIAs may aid in models assessing treatment complication risk ( Ji et al., 2016 ).
There are numerous different methods for the (semi-) automatic detection and segmentation of UIAs. Semi-automatic methods include, defining the neck of the aneurysm where it attached to the parent vessel, before segmenting the aneurysm ( Cardenes et al., 2011 ). The shape of the aneurysm has been used in some UIA detection techniques, including using blobness filters ( Hentschke et al., 2012 ) and shape analysis of the surface of the vessel segmentations ( Arimura et al., 2006 ;Bizjak et al., 2021 ;Lawonn et al., 2019 ). Furthermore, multiple deep learning techniques for UIA detection have been developed with high accuracy ( Faron et al., 2019 ;Nishimori et al., 2018 ;Park et al., 2019 ). However, most methods are developed for CTA or Digital Subtraction Angiography (DSA) 2D images ( Duan et al., 2019 ;Sulayman et al., 2016 ) and are for UIA detection only. The segmentation of UIAs is a difficult problem as UIAs can occur at many different locations and positions relative to the vessels.They are small and can vary greatly in shape and configuration. TOF-MRAs can also vary significantly during the time between baseline and follow-up scans, due to the use of different scanners, protocols, field strengths and field of view. This all leads to a basic requirement for accurate UIA detection and segmentation methods on TOF-MRA.
The Aneurysm Detection And segMentation (ADAM) Challenge described in this paper provides an overview of methods to fully automatically detect and segment UIAs from clinical TOF-MRA images ( Timmins et al., 2020 ). The aim was to compare methods and assess the performance over clinical data from an in-house test set. Evaluation was performed by ranking the methods against each other, for both the detection and segmentation of UIAs, by determining detection and segmentation metrics. This paper provides an overview of the challenge including the organisation, the results, a detailed evaluation of methods submitted and their performance on the test data. This paper follows the structure outlined in the Biomedical Image Analysis challengeS (BIAS) guidelines for transparent reporting of biomedical image analysis challenges ( Maier-Hein et al., 2020 ).

Challenge Organisation
The results of the ADAM Challenge 2020 were presented at the 23 rd International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) on October 8 th , 2020. From 3 rd April 2020, participants could register on the website ( http://adam.isi.uu.nl/ ) to participate in the challenge. They could download a training dataset (for full details on the data, see Section 2.3 ) to train and develop fully automatic methods for the challenge. Participants were also allowed to use their own training data, as long as they referenced this in their method descriptions. Once trained, methods were containerised by participants with Docker ( Merkel, 2014 ) and submitted to the organiser. Examples and instructions are provided on the website ( http://adam.isi.uu.nl/methods/ ). The containerisation allowed easy evaluation of the methods, guaranteeing it could be run on our platform. Submitted containers were run on an individual training case from the training dataset, containing UIAs, and the results were sent back to the participant for verification. If technical issues or bugs occurred, teams were allowed to resubmit a new version with the bugs fixed.
The final verified, submitted methods were evaluated on a test set of images (see Section 2.3 ) using evaluation code that was made publically available ( https://github.com/hjkuijf/ADAMchallenge ). If the method required, NVIDIA Titan Xp GPUs were used for evaluation. The deadline for submission for consideration for the challenge leaderboard at MIC-CAI was 17 th August 2020 and the results and awards were announced at the MICCAI conference (8 th October 2020). However, the challenge continues to remain open for submissions, with an up-to-date online leaderboard to allow for benchmarking of the methods. The ADAM challenge was advertised on the MICCAI website, various social media platforms, and via email to previous MRBrainS and WMH challenge participants ( Kuijf et al., 2019 ;Mendrik et al., 2015 ) .

Mission of the challenge
The ADAM Challenge consists of two tasks. Task 1 had the aim of automatic detection of UIAs on TOF-MRAs. Task 2 was for a method that could perform automatic segmentation of UIAs on TOF-MRAs. Participants could submit to one or both tasks, and methods submitted to task 2 were also assessed for task 1. The target cohort is the term used to describe the patient group of which data would be acquired for the final application of the submitted methods . For the ADAM Challenge the target cohort was any patient undergoing a clinical brain TOF-MRA to screen for the presence of an UIA. To reflect the clinical setting, some MRA scans were negative (i.e. a patient without any diagnosed UIAs) and some scans had more than one UIA. A patient in the target cohort may be scanned for the following reasons: (1) follow-up scans of patients with diagnosed UIA(s), with or without additional treated aneurysms; and (2) patients screened for positive family history of UIAs or aSAH. The challenge cohort is the term used to describe is the patient group of which the challenge data was acquired, for both the training and the test datasets . The challenge cohort consists of a subset of patients, who had an available TOF-MRA, from a cohort of patients at the University Medical Center (UMC), Utrecht with at least one diagnosed UIA and cohorts of persons screened for UIAs because of a positive family history for aSAH. The assessment aim of the challenge is to find a method that performs optimally for the automatic detection and segmentation of UIAs from the TOF-MRAs in the challenge cohort test dataset.

Challenge data sets
A total of 254 brain TOF-MRA scans were included with 282 untreated UIAs. The training dataset provided to participants consisted of 113 training cases, while the test dataset consisted of 141 cases, where each case contained a TOF-MRA and a structural image (either T1 -, T2weighted or FLAIR). All MRIs were performed at the UMC Utrecht, the Netherlands, on a variety of Philips scanners with field strength of either 1, 1.5 or 3T. The MRAs had an in-plane voxel spacing range of (0.195-1.04) mm and slice thickness range of (0.4-0.7) mm, without a set acquisition protocol. This was due to the clinical nature of the data and that it was taken from several studies across a long period of time (between 2001 and 2019). The subjects with UIAs ( N = 53) had a median age of 55 years (range 24-75 years), with 75% of subjects being female. A subset ( N = 156) of the dataset includes two scans from the same subject, both a baseline and a > 6 month follow-up scan, to reflect the real clinical data. The UIAs ranged in size, with a median maximum diameter of 3.6 mm and a range from 1.0-15.9 mm. 25% ( N = 52) of the scans contain multiple UIAs and 28% of the scans contained treated (either coiled or clipped) UIAs ( N = 59). The median age of the population without UIAs was 41 years (range 19-61 years) and 65% were female. This reflects the clinical setting, as UIAs are more common in females and the older generation ( Vlak et al., 2011 ). The dataset was realistic and diverse, reflecting different standard clinical protocols used between MR-scanners and over time.

Training and test data
Subjects were randomly split into training and test sets and it was ensured that both sets contained an adequate number of scans without any UIAs. Every case in the dataset contained one TOF-MRA and one structural ( T1 / T2 /FLAIR) MR image of the same subject. The training dataset consisted of 113 cases: 93 cases containing at least one untreated, UIA (35 baseline and 35 follow-up cases of the same subject and 23 cases of unique subjects) and 20 cases of subjects without UIAs. The test dataset consists of 141 cases: 115 cases containing at least one untreated UIAs (43 baseline and 43 follow-up cases of the same subject and 29 cases of unique subjects) and 26 cases of subjects without UIAs. The training data is available on the challenge website and requires a registration and acceptance of our terms of distribution. An example of a provided training case can be seen in Fig. 1 . A specific validation set was not provided and it is up to the participants to decide their own train/validation set split. Statistical tests were performed to ensure both training and test sets had a fair distribution of scans. An unpaired t-test was used to assess this difference in age, maximum diameter, and number of UIAs, number of treated UIAs, pixel spacing and slice spacing. Gender was assessed using Fisher's exact test, and the Chi-square test was used to assess location and magnetic field strength. The location categories used were: anterior cerebral or communicating artery (ACA/ACoA), the internal carotid artery (ICA), posterior communicating artery (PCoA), middle cerebral artery (MCA) and posterior circulation.

Pre-processing
All images were pre-processed with N4 bias-field correction ( Tustison et al., 2011 ). The structural image was aligned to the corresponding TOF-MRA using the elastix toolbox for image registration ( Klein et al., 2010 ). The transformation parameters used were provided with the training data. Both original and pre-processed data was provided to the registered participants.

Annotation procedure
All UIAs were diagnosed on the scans as part of clinical routine. The UIAs were manually segmented from the original TOF-MRAs using inhouse developed software implemented in MeVisLab (MeVis Medical Solutions AG, Bremen, Germany). A contour was drawn around the outline of the UIA, on all axial slices of the MRA. The parent vessel and any branching vessels were excluded from the annotation and annotations were always drawn starting from the UIA neck to the UIA dome. An experienced interventional neuro-radiologist ( > 10 years of experience) trained a second rater with considerable experience in medical image analysis and annotation software, but not specifically UIAs. The trained second rater annotated all images in the dataset. Finally, the first and  ( Klein et al., 2010 ) d) binary aneurysm image derived from annotations as described in Section 2.3.3 Bottom Row: e) pre-processed TOF-MRA using n4 bias field correction ( Tustison et al., 2011 ), f) structural MR image preprocessed using n4, g) pre-processed structural MR image aligned to TOF-MRA using pre-determined registration parameters.
second rater assessed the full dataset together and made required modifications to the annotations in consensus to form the official ground truth data set. During annotation, the raters had access to the structural image and a radiologist report made at the time of the scan, indicating the location and size of the UIA. The same annotation procedure was performed for all treated UIAs and dilated to create a slightly larger mask for exclusion of treated aneurysm.
The resulting annotations were converted to binary masks and voxels were considered part of the UIA if they were > 50% inside the contour. Untreated UIAs were given the label 1, treated UIAs label 2 and background was labelled 0. From the binary image, the centre of mass and maximum diameter of each of the untreated UIAs were determined in voxel coordinates in the corresponding TOF-MRA image space. This was provided in a text file for each training case.

Assessment method 2.4.1. Metrics and ranking
Task 1 and task 2 were evaluated separately using different metrics. All submitted methods for task 2 were also evaluated for task 1, where the centre of mass of 3D connected components in the image was used to determine the detection metrics.
For task 1, methods were evaluated by determining two detection metrics: (1) Sensitivity and (2) False Positive Count (the total number of false positives per scan). The sensitivity gives a measure of how many detected UIAs correspond to true UIAs, ensuring we optimise to detect as many of the UIAs as possible. False positive count balances the sensitivity ensuring not too many falsely identified UIAs are detected, which would not aid the radiologist.
For task 2, methods were evaluated by determining three segmentation metrics: (1) Dice Similarity Coefficient (DSC), (2) Modified Hausdorff Distance (MHD) (95 th percentile) and (3) Volumetric Similarity (VS) ( Taha and Hanbury (2015) .) DSC describes how much the prediction and ground truth segmentations overlap. If there was no detection of UIAs, then the DSC was zero. MHD is a distance metric which is sensitive to the shape of the segmentation. This is important when segmenting UIAs as the shape may be used to assess rupture risk. MHD was only calculated where there was any detection of UIAs by the method, if there was no detection then it was ignored. VS assesses the similarity in volume of the predicted and ground truth segmentation. Accurate volume segmentation is important for UIAs for growth assessment.
Individual UIAs were defined as 3D connected components. A detection was considered positive when the predicted coordinate was within the maximum diameter of the location of the centre of mass of the ground truth UIA.
A similar ranking was performed for both tasks . Teams were ranked per metric. The rankings were averaged to achieve the overall ranking per task. For each team, each metric was averaged over all test scans containing UIAs, other than false positive count, which was evaluated over all test scans, independent of UIA presence. Next, for each average metric, the participating teams were ordered from best to worst. The metrics were scaled linearly to a number between 0 (corresponding to the best team) and 1 (worst team) and then averaged to obtain a single 'rank'. For task 1 the two detection ranks were averaged, and for task 2, the three segmentation ranks were averaged. For task 2, average interobserver segmentation metrics were also found based on measurements made by two separate observers, on a subset of the scans.

Further analyses
To evaluate the performance and approach of each method, more analyses were performed beyond the ranking procedure. In this way, we could determine if there were particular factors that affected the results including both the method approaches and the data characteristics. This included investigating the different method approaches, UIA size dependence, intra-subject variance and assessing train vs test performance.

Method analyses.
Based on the ranking of the method, a detailed look at each method could be performed to see and characterise similarities and differences between the performances. This was performed to investigate if some methods performed significantly better than others and if method design had an influence on performance. Bootstrapping was performed to compute 95% confidence intervals for each metric and ranking for each team. 2,000 random samples were taken from the test set with replacement. If confidence intervals did not overlap, methods were considered to have significantly different perfor-mance. Furthermore, the STAPLE algorithm ( Warfield et al., 2004 ) was used to ensemble first, all of the segmentations from each method and second, the segmentations from the top 3 teams in task 2. Segmentation metrics and rankings were determined for these STAPLE ensemble method results and compared to the individual team performances.

Segmentation performance of true UIAs.
To assess the segmentation performance of the methods, the segmentation metrics were determined for only the true detected UIAs, excluding any false positives. This was done in order to imitate how the tool could be used in clinical practice; as a radiologist will only select a correctly detected UIA for segmentation. To make a similar scenario, it was assessed first if the predicted segmentation overlapped with the ground truth segmentation. Connected component analysis was performed on the predicted segmentation. If a connected component overlapped with the ground truth segmentation, it remained and all other connected components (false positives) were removed. Segmentation metrics were determined for the remaining connected component relating to the true UIA. This was performed for each predicted segmentation by each team and a mean of the metrics and a ranking was made for each team.

Detection performance on negative scans. When screening for
UIAs, some scans will be negative if a patient does not have UIAs. A well performing method should have a low false positive rate on the negative scans, as no true UIAs exist in these scans. Twenty-six scans of the test set did not have any UIAs, and the performance of each method on these scans was evaluated by determining the average false positive count. The average false positive count in negative scans was compared to the average false positive count in all scans in the test set containing true positives.

Size of UIAs.
It was thought that the size of aneurysm would affect the performance of methods, as it is known that detection rates from visual inspection are lower for smaller aneurysms . The relationship between the size of the UIAs and the detection and segmentation performance was investigated. Both sensitivity and DSC were assessed for each team in four different size quartiles based on the maximum UIA diameter.

Intra-subject analyses.
Both the training and test data contained a subset of baseline and follow-up scans of the same subject. As this is common in clinical practice, it is vital that a measurement method should perform to a similar standard for both baseline and follow-up imaging, even though the two scans may differ in scanner type, acquisition protocol and quality. An accurate measure of the volume difference between follow-up and baseline scans is important to be able to detect growth of the UIA. To assess if the method could detect growth, the difference in volume between baseline and follow-up ground truth segmentations was determined (ground truth volume change). This was compared to the difference in volume of follow-up and baseline predicted segmentations by each method (predicted volume change). These measurements were only assessed for detected true UIAs, where the UIA was detected on both baseline and follow-up scan by the method. Similarities between the two volume change measurements indicate how reliable the measurement of the method is and this was assessed using Kendall's rank correlation measure ( Kendall, 1938 ). Kendall's tau indicates how well two values correspond, where 1 indicates a strong agreement, 0 indicates no association and -1 indicates a strong disagreement.
Furthermore, a method that performs well, and to the same standard, in both baseline and follow-up scans is required. The intrasubject performance of each team was investigated by comparing the evaluation metric for the baseline scan to the metric at the follow-up scan. A Wilcoxon-signed rank test was used to compare the two values for each team. This was performed for sensitivity, to assess detection performance, and DSC and volumetric similarity for segmentation performance.
2.4.2.6. Train vs test performance. To assess performance differences between the training and test data, all methods were re-run on the training set and detection and segmentation metrics were determined. Performance should be similar to that of the test set and a large increase in performance indicates that the method may not be very generalisable to unseen data. A similar ranking of methods was made and this performance was compared to the performance of the methods on the test set.

Training and test data
There were no statistically significant differences between the cases of the training and test datasets in age ( p = 0.20), sex ( p = 1), maximum diameter of the UIA ( p = 0.58), number of UIAs ( p = 0.32), number of treated UIAs ( p = 0.45), magnetic field strength of the scanner ( p = 0.11), in-plane voxel spacing of the scan ( p = 0.43), slice thickness of the scan ( p = 0.78).

Challenge submission
Over 250 users registered for the challenge on the website, and 11 teams submitted methods. Two teams submitted only under task 1, for the detection of UIAs, and nine teams submitted under task 2, for the segmentation of UIAs. Results, presentations, posters and a brief description of all submitted methods can be found on the challenge website ( http://adam.isi.uu.nl/results/resultsmiccai-2020/ ). The inference code submitted in Docker containers for the challenge is also available for most methods on DockerHub ( https://hub.docker.com/orgs/adamchallenge ).

Task 1 Submissions
MiBaumgartner submitted a 3D neural network based on the Retina U-Net architecture ( Jaeger et al., 2018 ). The decoder was extended to incorporate semantic segmentation information and followed by a Path Aggregation Network ( Liu et al., 2018 ) to generate the features used for the detection prediction. ( Baumgartner et al., 2020 ) Unil_chuv submitted a 3D U-Net ( Ronneberger et al., 2015 ) which was patch trained using patches selected based on landmark points from a registered vessel atlas ( Mouches and Forkert, 2019 ). Both the ADAM dataset and an in-house dataset for training. On inference, patches were evaluated only if they were within a set distance from the registered landmark points and had a minimum intensity. A maximum number of four false positives were allowed based on the average brightness of the connected components. ( Di Noto et al., 2021, 2020

Task 2 submissions
IBBM submitted a 2D convolutional neural network with TriWinged-Net architecture based on the BtrflyNet ( Sekuboyina et al., 2018 ). MIPs of the MRAs were made in all three orientations (axial, coronal and sagittal) with each view as a different input branch. These are encoded separately before being concatenated in the centre of the network. From this, there were three corresponding decoding branches, to provide segmentation masks for each view which were, finally, recombined to form the full segmentation volume. ( Shit et al., 2020 ) Inteneural submitted a method including three 2D neural networks with U-Net architecture based on EfficientNet ( Tan and Le, 2019 ) that were pre-trained using ImageNet ( Fei-Fei et al., 2010 ). Each network was fine-tuned for one axis: axial, coronal and sagittal with 2 input channels: raw TOF signal and blood vessel segmentation, which was Table 1a Task 1: Average metrics and ranking for each team, with the lowest (best) rank placing highest in the table. Each value is provided as a mean of all scans (95% confidence interval, determined using bootstrapping). The dotted lines indicates groups of methods that can be considered to have statistically different ranking from the other groups as their 95% ranking confidence intervals do not overlap. performed using Jerman filter ( Jerman et al., 2015 ). A loss function including both a generalised dice loss ( Sudre et al., 2017 ) and boundary loss ( Kervadec et al., 2021 ) was used. The final prediction was determined as an average of the evaluated models' outputs. ( Wali ń ska et al.,

)
Joker submitted a 3D fully-convolutional neural network based on no new U-Net (nnUNet) ( Isensee et al., 2021 ). Group Normalisation ( Wu and He, 2018 ) was used instead of Batch Normalisation and leaky ReLU was used. A Dice ranking loss was used for training. Predictions were made by four separately trained models and ensembled using majority voting. ( Yang et al., 2020 ) JunMa submitted a 3D fully-convolutional neural network based on no new U-Net (nnUNet) ( Isensee et al., 2021 ). Networks were trained using five-fold cross validation and two different loss functions: Dice loss and cross entropy, and Dice loss with topK loss ( Berrada et al., 2018 ) because the two losses have been proven to be robust on highly imbalanced segmentation tasks ( Ma et al., 2021 ). At prediction, the five models with optimum performance were ensembled. ( Ma, 2020 ;Ma and An, 2020 ) Kubiac submitted an ensemble of 18 neural networks with three network variants: A two path dual resolution fully convolutional neural network and two U-Net ( Ronneberger et al., 2015 ) style architectures with two paths including contextual information in both the encoding and decoding path ( Hilbert et al., 2020 ) trained on different loss functions. The loss functions were the sum of cross entropy, (generalised) Dice loss ( Sudre et al., 2017 ) and boundary loss ( Kervadec et al., 2021 ). ( De Feo et al., 2020 ) Stronger submitted an ensemble method of three models, where each model included a segmentation and a classification stage. The segmentation stage was based on a patch-trained 3D U-Net . The classification consisted of a 3D convolutional neural network to distinguish between true and false positives. ( Hu et al., 2020 ) TUM-IBBM submitted a U-Net based architecture with MRA and aligned structural image as different input channels ( Li et al., 2018 ). Two networks were trained on sagittal and coronal slices and during testing, voxelwise predictions of both models were averaged. ( Loehr et al., 2020 ) Xlim submitted a hybrid two input neural network: one for 3D patches and the second for the corresponding maximum intensity projection of the patches ( Nakao et al., 2018 ). The two paths are brought together with a final concatenation layer. The patches consist of vessels only, segmented from the MRAs using an intensity and morphological transform based method. ( Rjiba et al., 2020 ) Zelosmediacorp submitted a 3D fully convolutional neural network with a U-Net like architecture ( Ronneberger et al., 2015 ) trained on patches centred on the average UIA position. Twelve networks were trained on four different training and validation splits, and the best of four networks were selected to form an ensemble that averaged the outputs of each network on the test set. Monte-Carlo dropout ( Wang and Manning, 2013 ) was used for both training and inference. ( Giroud and Dubost, 2020 ) Further, more in-depth descriptions of each method can be found on the website ( http://adam.isi.uu.nl/results/results-miccai-2020/ ).

Metrics and rankings
The mean performance of each participating team for task 1 is shown in Table 1a ) and for task 2 is shown in Table 1b ). The dotted lines indicate groups of methods that can be considered to have statistically different ranking from the other groups as their 95% ranking confidence intervals do not overlap. Figs. 2 and 3 are bar charts and boxplots to show the distribution of metrics for each team. For task 1 the method of xlim performed best for sensitivity and the method of IBBM performed best for false positive count. Based on the overall ranking (equal weighting of both metrics) mibaumgartner performed the best for task 1. For task 1, mibaumgartner, joker, junma and kubiac had overlapping bootstrapped confidence intervals for rank and thus were considered to have not substantially different performance from each other. For task 2, junma had the best DSC and VS and joker had the best MHD. Based on the overall ranking (equal weighting of all three segmentation metrics) junma performed the best for task 2. For task 2, junma and joker performed statistically significantly better than any other methods based on the bootstrapped confidence intervals being non-overlapping with any other methods. The bottom row of Table 1b ) indicates the interobserver agreement of two observers. This was assessed as a mean over 144 scans (72 paired baseline-follow-up scans). The average metrics are much higher than any submitted method. An example segmentation of team junma can be seen in Appendix A , Fig. 1 .

Method analyses
All 11 submissions for both tasks used deep learning techniques for the detection and/or segmentation of the UIA and information about the methods is provided in Table 2 . The U-Net ( Ronneberger et al., 2015 )

Table 1b
Task 2: Average metrics and ranking for each team, and the brackets contain the 95% confidence interval determined using bootstrapping. The dotted lines indicates groups of methods that can be considered to have statistically different ranking from the other groups as their 95% ranking confidence intervals do not overlap. STAPLE (all) and STAPLE (top-3) are the average metrics and ranking of the segmentation from the STAPLE algorithm of all and the top-3 methods, respectively. Interobserver are the metrics comparing manual segmentations of two different observers on a subset of the scans. DSC: Dice Similarity Coefficient; Modified Hausdorff Distance: MHD; VS: Volumetric Similarity.

Table 2
Submitted methods sorted on their final ranking per task, with highest placed ranking first, and information about method design.  was the most common architecture with 72% (8/11) submissions using a U-Net style architecture for at least part of their method. The top two ranking segmentation methods used nnU-Net ( Isensee et al., 2021 ) as the base for their approach. Seven methods used 3D approaches, including the top 5 ranking methods. All methods incorporated the Dice loss in their loss function for training, however junma and joker, the top-ranking segmentation methods, also incorporated topK loss ( Berrada et al., 2018 ). Ensembles were commonly used, and appeared to boost performance with the top 5 methods for task 1 and 2 using an ensemble. Ensembles were used by different teams in various ways for example: with different validation splits, different loss functions and different architectures before combining the trained models. Unil_chuv was the only team to use an external, in-house dataset for training. 8/11 teams use augmentation of the training data and 7/11 teams used post-processing techniques to reduce the number of false positives.

Segmentation Performance of true UIAs
To evaluate segmentation performance, average segmentation metrics were determined for all teams for only the true UIAs that were detected, as displayed in Table 3 . A similar ranking was made as for task 2 based on these metrics. It was observed that this ranking changed the placing of the teams, as is shown by the red brackets and arrows. However, the top 3 teams remained unchanged in position. The box plots of the segmentation metrics for each team over detected UIAs only is shown in Appendix B , Fig. 1 .

Table 3
The mean segmentation metrics of each team evaluated only on the detected true UIAs. The arrows and brackets in red signify the difference between the original task 2 ranking ( Table 1b ), and the ranking based only on the detected UIAs. All values are quoted as means with 95% confidence intervals determined by bootstrapping in brackets.

Detection Performance on Negative Scans
The average false positive count over all scans containing no true UIAs was determined ( Appendix C , Table 1 ). This can be compared to the average false positive count for all scans with true UIAs. Teams IBBM, zelosmediacorp, junma and joker all have a zero false positive count for the scans containing no UIAs. All teams have a smaller false positive count per scan for the negative scans, compared to the positive scans containing true UIAs. IBBM and zelosmediacorp have a low false positive count for positive scans (0.02 and 0.06 respectively), but they also had a very low true positive count. Junma and joker have a substantially higher false positive count for positive scans (0.22 and 0.20 respectively).

Size of UIAs
The detection and segmentation performance improved with the size of the UIA. Fig. 4 shows the increase in sensitivity with increasing UIA diameter, when assessing the UIA diameter in four quartiles. This was represented as the mean sensitivity over all teams for each UIA. The error bar shows the 95% confidence interval of the mean. In Appendix D , Fig. 1 , it can be seen how the sensitivity of each individual team varies with size of UIA. Fig. 5 a) and b) demonstrate that the segmentation performance also increased with UIA size. In 5a) the median DSC over all teams for each UIA was plotted against the individual UIA diameter. In 5b), the UIA diameter is again split into four quartiles and the mean DSC over all teams for each UIA was included. DSCs for individual teams were plotted in Appendix D , Fig. 2 . Table 4 shows the volume change measurements, the ground truth measurements and the predicted measurements for each team, and how well they agree using the Kendall's tau correlation measure. All measurements are taken only for true UIAs with a positive detection in both baseline and follow-up scans. This means that the ground truth volume is also different as it is taken as a mean over a different set of scans. The median ground truth difference over all baseline and follow-up scans was 2.9 μl. Team IBBM was not included, as less than 5 true UIAs were detected for both baseline and follow-up scans. Junma were found to have the highest statistically significantly agreement between ground truth and predicted volume change (Kendall's tau > 0.5, p < 0.05). Inteneural had a Kendall tau < 0, which indicates there was some disagreement between ground truth and predicted volume change. Stronger and TUM_IBBM had values for Kendall's tau which were close to zero, suggesting that there is no association between ground truth and predicted volume change for these methods.

Intra-subject analyses
The performance of each method was evaluated between baseline and follow-up scans using the Wilcoxon rank test, the results of which can be seen in Appendix E , Fig. 1 . For sensitivity, DSC and volumetric similarity, all methods had p > = 0.05 suggesting that performance was not different between baseline and follow-up subjects.

Table 4
Comparison of volume change measurements (median (IQR)) for ground truth and predicted segmentations with correlation measure, Kendall's tau ( p value). Volume change measurement was determined as the volume of the follow-up volume minus the baseline volume in μl. Note that the ground truth volume is different for each team, as it is evaluated only over true UIAs that were detected in both baseline and follow-up scans by the method.

Train vs test performance
All the submitted methods were also evaluated on the training data. The results can be seen in Appendix F , Tables 1a and 1b ; which correspond to Tables 1a and 1b in the main text. As expected, the results on the training data are generally better than on the test data. For task 1, the overall ranking remains roughly similar, with some teams going up or down a few places. This could suggest that some methods generalise less well to the unseen test data, resulting in a lower performance on the test data as compared with the training data. For task 2, the top-4 ranking methods remained the same order of ranking as when assessed on the training data. All methods show a considerable drop in performance when assessed on the test set, relative to the training set. This suggests that the methods submitted for task 2 do not generalise well to the test data set.

Discussion
This paper presents the results and analysis of the Aneurysm Detection and segMentation Challenge held at the international conference of Medical Image Computing and Computer Assisted Intervention (MIC-CAI) in October 2020.
Two methods perform significantly better than all other methods for both tasks: (1) detection and (2) segmentation of UIAs on TOF-MRAs. Although the results are encouraging for automated UIA detection and segmentation methods, there is still room for substantial improvement. Compared to visual UIA detection from MRAs, the sensitivity of the submitted methods is, on average, lower than quoted in literature ( HaiFeng et al., 2017 ;White et al., 2000 ). The submitted segmentation methods also show a lower performance than the two observers in this study. Future developments will hopefully bring new and updated methods that are closer in performance to manual methods.

Top ranked methods
Mibaumgartner placed first in task 1 for detection of UIAs and did not participate in the second segmentation task. The method focuses on the detection task, by outputting bounding boxes from which a centre of mass was derived, as opposed to performing semantic segmentation. This is different to all other submitted methods. Mibaumgartner opts to still include semantic segmentation information by using Retina U-Net ( Jaeger et al., 2018 ), before classifying and regressing anchor boxes using a Path Aggregation Network ( Liu et al., 2018 ). Mibaumgartner did not discriminate between treated and untreated UIAs, using both as foreground voxels for training, which was different from other methods. This may have aided detection by giving more examples as some aneurysms treated with coils may look similar to untreated UIAs. As treated UIAs were masked on evaluation, this did not negatively affect the performance. Furthermore, mibaumgartner used both the structural MR images and the MRAs, which may have aided in the performance of the model by incorporating more information. Although mibaumgartner has the highest overall ranking, it does not achieve the highest sensitivity or lowest false positive count.
For task 1 and task 2, the methods of junma and joker showed comparable performance, both ranking above the other methods. Both use a 3D U-Net architecture based on the no new net (nnUnet). The nnUnet is an "out-of-the box tool for state-of-the-art segmentation " which is an open-source deep learning segmentation framework that automatically adapts to new datasets. In December 2019, the nnUNet performed optimally or on par with the best methods in 19 different biomedical image analysis challenges, including the KiTS challenge ( https://kits19.grandchallenge.org/ ), the largest challenge at MICCAI 2019 ( Isensee et al., 2019 ). Joker made some small changes to the model, including using group normalisation instead of batch normalisation, although this did not appear to make much difference to its overall performance. Joker also used the structural images as input for training.

Method analyses
All top 3 methods for each task used an ensemble of trained models for prediction and in total 7/11 submitted methods used an ensemble. It is known that ensembles of deep learning models can aid in both image classification ( Krizhevsky et al., 2017 ) and segmentation tasks ( Kamnitsas et al., 2018 ;Kuijf et al., 2019 ). In general, ensemble methods were made up of models trained on different train/validation data splits or cross-validation. Winning team junma trained using five fold cross-validation and two different loss functions, before selecting the optimal five trained networks (based on DSC) to ensemble. Joker used an ensemble of four networks, which included networks trained for different classes in the scan (both treated and untreated UIAs) as well as including the structural MRI scans in two of the networks. The STAPLE analysis confirms that ensembles perform well, with an ensemble of all segmentations from all methods achieving the best ranking. STAPLE using an ensemble segmentation of the top three teams for task 2, junma, joker and kubiac, performs better than joker and kubiac individually but junma still remains the highest ranking.
In addition to joker, the methods of mibaumgartner, kubiac and TUM_IBBM also use the structural images in their method suggesting that the networks may benefit from having the information contained in the structural images when detecting and segmenting UIAs. Other teams use the structural images to aid in patch selection for training.
The volume of an UIA is a very small percentage of the volume of a whole TOF-MRA, and in most MRAs only one UIA is present. As a result of this unbalanced problem, most methods chose to use ground truth knowledge for the patch selection, choosing a particular proportion of training patches to contain an UIA. Only two methods, inteneural and xlim, perform vessel segmentation on the TOF scans before performing UIA detection/segmentation. However, both methods are middle ranking (0.39, 0.41 respectively), suggesting that vessel segmentation may not help much in UIA detection or segmentation. Almost all task 2 segmentation methods used dice loss in some form for training their networks. This is a calculated choice, as dice is one of the metrics on which we evaluate the submitted solutions. Some methods use the generalised dice loss ( Sudre et al., 2017 ), which has proven to be reliable for unbalanced problems, and others in combination with other loss functions such as cross-entropy, topK and boundary loss ( Kervadec et al., 2021 ). The winning method junma used an ensemble of methods trained using dice + cross entropy and dice + topK loss. Kubiac and inteneural both included the boundary loss in their loss functions for training their models. By including boundary loss, the models are trained to minimise the distance between the predicted and ground truth segmentations. This reduces the problems associated with Sensitivity of all teams for each UIA as a function of maximum UIA diameter in mm, when separating UIA diameter into four quartiles. Each point included in the box plot is the mean sensitivity of all teams across each UIA.
regional based metrics, such as Dice, for highly imbalanced data. Kubiac and inteneural have similar performance for task 2 (rankings 0.24 and 0.39 respectively) and this may be due to the similar architecture and loss function used.
Many teams performed post-processing to only accept positive detections of a certain number of voxels, within a range that was common in the training dataset. Further, some teams even limited the maximum number of true positives found based on probability, size or intensity of the predictions. This aided in the challenge ranking, as we explicitly evaluated on false positive count. This can be seen for example by xlim, with a mean false positive count of 4.03 but a sensitivity of 70%, meant their ranking was lower than if they had perhaps used a further false positive reduction method.

Segmentation Performance of true UIAs
The top three teams in task 2, junma, joker and kubiac , also ranked top for segmentation performance of true UIAs only. Junma with a DSC of 0.64 is slightly higher than the interobserver DSC of 0.63. The MHD and VS are comparable to the MHD and VS of the interobserver, with all 95% confidence intervals overlapping. This suggests that the automatic segmentation method performance is on par with the manual segmentation, once the true UIA has been identified. This method could be used in the clinical research or routine, whereby a radiologist would only need to select an UIA, from a small population of candidate UIAs, and segmentation of the correct UIA could be performed.

Detection performance on healthy scans
Top performing teams junma and joker also perform well on scans without true UIAs, and have an average false positive count of 0 for such scans. This would be ideal for in the clinic by not wrongly identifying UIAs, and providing radiologists with more work to censor these falsely identified UIAs. Team IBBM and zelosmediacorp also had a false positive count of 0, however, their overall detection performance (sensitivity) across all scans, including those with positive UIAs, was poor.

Size of UIAs
Overall, it was clear that both detection and segmentation performance was better for all methods for larger UIAs, as both sensitivity and DSC increased with UIA diameter (Spearman's coefficient = 0.47 and 0.42 respectively). Not surprisingly, smaller UIA are more difficult to detect, which is also consistent with studies investigating visual detection of aneurysms. White et. al. ( White et al., 2000 ) cite an average of 87% sensitivity for detecting UIAs on MRAs by radiologists, of which sensitivity is 38% for UIAs < 3 mm and 94% for UIA > 3 mm. From the results, it can be seen that the lower quartiles of diameter have a comparable sensitivity. Xlim has the highest sensitivity with 71% for UIAs with diameter > 3.54 mm and < 4.98 mm and 95% for UIAs > 4.98 mm. As such, this method may be suitable for detection of larger UIAs with performance that is on par with human visual inspection. We assessed segmentation using DSC, which is a difficult measure for small objects and is limited by voxel sizes of the images. For small UIAs, with few voxels, the overlap will be less likely and this results in a smaller DSC.

Intra-subject Analyses
Comparing volume change between ground truth and predicted segmentations, found that different methods performed differently. Junma had the best agreement between ground truth and predicted volume changes (Kendall's tau > 0.5), suggesting that can accurately measure volumetric change and growth. Junma had the best segmentation performance overall which could explain the volumetric change agreement. For some methods there was disagreement or almost no association between the predicted and ground truth volume changes, suggesting that these methods are not appropriate for measuring volumetric growth. It was also noted that the actual volumetric change was very small, and none of the aneurysms showed considerable growth between baseline and follow-up. The small volumetric change may explain the low volumetric change agreement of all methods. Based on the segmentation metrics and Wilcoxon rank test, the methods performed similarly for both baseline and follow-up scans. One variable that may have affected the intrasubject performance, was the train, test and validation splits between the methods, as many methods did not take baseline-follow-up pairs into account.

Train vs test performance
Most methods, for both tasks, had a considerably lower performance on the test data than on the training data. This suggests that these methods did not generalise well to the unseen data. Reasons for this could be in the method design, the training/validation data splits, aneurysm sizes, or not taking into account the baseline-follow-up pairs. The distribution of aneurysm and scan characteristics is similar between the training and test sets, ensuring that the training data is representative of the test data. Nevertheless, some features such as aneurysm shape or the configuration with respect to the parent vessel were difficult to take into account, as they can vary considerably between patients. This reflects the true clinical nature of the data set, but ideally methods should be able to detect and segment UIAs, even on unseen examples.

Future work
Overall, further improvement is necessary to be comparable to manual clinical standards for UIA detection and segmentation. All methods performed worse for smaller UIAs and as small UIAs are often overlooked by radiologists, this would be a main aspect for improvement of the methods. Furthermore, with increased screening studies, detection of small UIAs would be beneficial to speed up workflow and to learn more about the prevalence of UIAs in the general population. The best detection method used a network specifically designed for detection as opposed to semantic segmentation. The other submitted methods appear limited for detection with most using a generic semantic segmentation method. This suggests that a "brute force " technique, by just applying a standard U-Net architecture, may not be optimal for this problem. Instead, future developments should think out of the box. It was also noted that few methods use information from the structural images to aid in their methods. Perhaps some prior knowledge of, for example, the location, shape and size of the UIA would aid in the method performance. The dataset was a true clinical dataset, with a mixture of scan parameters, and although this makes it technically challenging, a method that performs well over the whole test set would be very convenient to have for clinical use. For larger aneurysms, the top-ranked detection methods had a performance that was on par with human visual detection suggesting that these methods could be used for the detection of larger UIAs.
The method of junma showed promising segmentation performance on the true UIAs. This suggests that a semi-automatic workflow allowing a radiologist to identify the location of the UIA and then using the model of junma as an accurate method of UIA segmentation may already be of use in current clinical practice. In future work, incorporating this segmentation method, with an improved detection method, may lead to an optimal automatic detection and segmentation method for UIAs.

Conclusions
The provided results were presented at the 23rd International Conference of MICCAI 2020. Methods for UIA detection and segmentation are encouraging but require further development before being able to be accurately used to detect, segment and quantify UIAs automatically, to the same level as a radiologist. However, detection methods may be suitable for use for larger aneurysms. Furthermore, segmentation performance of the top ranking method suggests it may be suitable for UIA segmentation after manual selection of the true UIA. The ADAM challenge remains open for submission of both new and improved methods .

Data availability
Training data and results are available at http://adam.isi.uu.nl/ . Scripts for evaluation of methods can be found at: https://github.com/ hjkuijf/ADAMchallenge .
The test set is not publicly available, as it is kept secret for evaluation purposes of the submitted methods. The inference code submitted in Docker containers for the challenge is also available for most methods, whose teams gave permission, on DockerHub ( https://hub.docker.com/orgs/adamchallenge ).

Appendix A. Segmentation example
For the top row, one slice of the MRA is shown, where the segmentation of the ground truth and predicted segmentation is similar, shown by the large overlap. In the bottom row, the predicted segmentation is much smaller than the ground truth and there is little overlap. The junma method segmented better in the centre of the aneurysms than at the edge of the aneurysm. Fig. A1   Fig. A1. Segmentation of team junma on an example test case. Figure 1: slices of the TOF-MRA of the same test case, to show how the segmentation from the Junma method varied from the ground truth segmentation. Columns a) no segmentation overlaid, b) both segmentations overlaid in yellow, c) ground truth segmentation overlaid in green, d) predicted segmentation overlaid in red.

Appendix F. Train vs Test Performance
Tables F1a and F1b .

Table F1a
Task 1: Detection metrics and ranking assessed for all methods on training data. FP is average false positive count over all scans, sensitivity is average sensitivity over scans containing true UIAs. Difference is the average values of the test set subtracted from the average value of the training set.