Investigation on performance of multiple AI-based auto-contouring systems in organs at risks (OARs) delineation

Manual contouring of organs at risk (OAR) is time-consuming and subject to inter-observer variability. AI-based auto-contouring is proposed as a solution to these problems if it can produce clinically acceptable results. This study investigated the performance of multiple AI-based auto-contouring systems in different OAR segmentations. The auto-contouring was performed using seven different AI-based segmentation systems (Radiotherapy AI, Limbus AI version 1.5 and 1.6, Therapanacea, MIM, Siemens AI-Rad Companion and RadFormation) on a total of 42 clinical cases with varying anatomical sites. Volumetric and surface dice similarity coefficients and maximum Hausdorff distance (HD) between the expert’s contours and automated contours were calculated to evaluate their performance. Radiotherapy AI has shown better performance than other software in most tested structures considered in the head and neck, and brain cases. No specific software had shown overall superior performance over other software in lung, breast, pelvis and abdomen cases. Each tested AI system was able to produce comparable contours to the experts’ contours of organs at risk which can potentially be used for clinical use. A reduced performance of AI systems in the case of small and complex anatomical structures was found and reported, showing that it is still essential to review each contour produced by AI systems for clinical uses. This study has also demonstrated a method of comparing contouring software options which could be replicated in clinics or used for ongoing quality assurance of purchased systems. Supplementary Information The online version contains supplementary material available at 10.1007/s13246-024-01434-9.


Introduction
To create a patient-specific radiotherapy plan, the radiation oncologists (ROs) manually contour the tumour or target region and organs at risk (OARs) on the patient's computed tomographic (CT) or magnetic resonance (MR) images.The accuracy of the contours is essential as inaccurate contours have the potential to affect the outcome of the treatment.The manual contouring process is time-consuming, and the time taken for manual contouring can vary according to professionals' abilities and knowledge.It can take several hours to complete contouring for one patient [4].Previous studies found that manual contouring can take up to 3 h in Head and Neck intensity-modulated radiotherapy (IMRT) planning [9].
These factors can also lead to noticeable delays in treatment, resulting in unwanted treatment outcomes [4].A previous study found that the increased waiting time for radiotherapy can increase the risk of local recurrence, which can be translated into decreased overall survival rate in some clinical situations [6].
Additionally, the contouring process suffers from large inter-and intra-observer contouring variabilities between professionals [9,12,17,20].A considerable mean volume variations of about 50% during parotid delineations was found [9].A study of inter-observers/institutions variability in target and OARs contouring for breast radiotherapy planning found that the overlap between manually contoured structures was low (up to 10%) and the variation between manually contoured volumes had standard deviations up to 60% [17].Inter-observer variations were also found in radiotherapy planning for other anatomical sites such as cervical cancer radiotherapy [12] and oral cavity cancer radiotherapy [20].Inter-observer variation has been shown to have a dosimetric impact during radiation therapy planning [17].
The auto-segmentation method has the potential to replace manual contouring.This auto-contouring technique was developed based on the capability of the algorithms to use prior knowledge.In the early stage, the auto-contouring technique had no or minimal capability of using prior knowledge due to limitations on computing power and the limited availability of prior segmentation data.These were low-level segmentation approaches such as intensity thresholding, region growing, and heuristic edge detection [4].As the computer powers rapidly developed along with a much larger availability of prior knowledge, the autocontouring developed rapidly, for example, Atlas-based autocontouring and deep-learning auto segmentation depending on the size of prior knowledge used in the technique.
Deep-learning auto-segmentation is a technique of machine learning where the algorithms learn or get trained to calculate the final contour.This technique uses a multilayer neural network called convolution neural networks (CNNs) [4,31].A large set of pre-contoured data referred to as training data, is passed through the CNNs to train the algorithm and optimise its parameters through the backpropagation algorithm to calculate and create the optimised contour for target structures [16,31].The type and performance of deep-learning based auto-segmentation depend on which network structure was used, such as U-Net [24], V-Net (3D version of U-Net) [4] or ResNet [14] and the quality and quantity of training data set [2,31].More advanced network structures such as vision transformer (ViT) were introduced [28] and other studies showed ViT performed better than CNNs when both networks were trained on larger datasets [11].
Many studies have compared the performance of in-house AI-based, and atlas-based auto-contouring systems in OAR delineation accuracy in different cancer types such as Head and neck [5], breast [8], and liver [1].Even though these studies had demonstrated its better performance in OAR contouring and better efficiency over atlas-based auto-contouring, the development and implementation of in-house AI-based auto-contouring can be complex due to challenges such as the required expertise in developing and implementing the programming code and limitations in collecting a large amount of "training" set [26].

Quantitative evaluation method
The volumetric Dice Similarity Coefficient (DSC), surface Dice Similarity Coefficient (sDSC) and maximum Hausdorff Distance (HD) between manual segmentation and AI-based auto-contouring systems' segmentation were calculated to quantitatively evaluate the performance of each AI-based auto-contouring software in OAR delineations [25].The DSC, sDSC and HD were calculated using python script with PlatiPy version 0.4.0 [7].The volumetric Dice Similarity Coefficient (DSC) calculates the overlap between 2 contoured volumes and is defined as: Where A is the volume of manual contours and B is the volume of contours delineated by an AI system.The value of the DSC metric varies from 0, which illustrates no overlap between two contours, to 1, which illustrates the complete overlap between two contours.
The surface Dice Similarity Coefficient (sDSC) is a new metric for assessing the segmentation performance introduced by Nikolov et al [21].This metric calculates the overlap between the two surfaces at a defined tolerance ( ) and is defined as: where S A and S B are surface of manual contours (A) and AI contours (B) and B A and B B are the border regions of manual contours (A) and AI contours (B) respectively.As in radiotherapy, the OAR is contoured slice by slice and the segmentation performance is assessed by the fraction of the surface of the contour which needed to be edited, sDSC has been suggested as a more suitable metric compared to volumetric DSC to assess the segmentation performance as the volumetric DSC weighs all regions where two volumes do not overlap equally and independently of their distance from the surface, and is biased towards OARs which has large volume [21].Another study showed that sDSC is a better indicator than DSC and HD of the time needed to edit and time saved by using auto contouring systems [27].The tolerance parameter needs to be set appropriately where variation is clinically acceptable by measuring interobserver variation in contouring [21].For this study, value of 0 mm was used for sDSC calculation to evaluate the absolute difference between manual and AI system's ||a − b|| is the Euclidean distance between point a in A and point b in B. The zero HD value represents there is no difference between 2 contours' shapes but as the HD value increases, the difference between 2 contours' shapes are increasing.
To ensure a valid comparison, cases with non-identical numbers of data sets were divided into separate groups, ensuring that each set had an equal number of data points when calculating mean DSC, sDSC and HD.For instance, 19 cases were selected for testing in spinal cord segmentation.However, data from RTAI was unavailable for 9 out of the 19 cases, as the RTAI model was exclusively designed for Head and Neck cases at the time of the study.

Statistical analysis
The statistical difference between each index of DSC and HD for each tested AI-based software was calculated using a suitable type of statistical test between 3 tests, (1) Student's t-test, (2) Welch's t-test and (3) Wilcoxon-Signed Rank test, depending on properties of compared data sets

HD(A, B) = max(h(A, B), h(B, A))
h(A, B) = max a∈A min b∈B ||a − b|| with a p-value lesser than 0.05 indicating significance [26].The test was automated using an in-house Python script combined with published python packages.The box plots of each data set in each case were created to check if there are any outliers.Then the histogram was created to visually inspect the distribution of data.The Shapiro-Wilk and Q-Q plot tests were used to test the normality of the distribution of each sample.When the data was assumed to be normally distributed, the F-test was used to find whether each compared data set's variance are equal.The Student's t-test was used in case of equal variance between 2 compared data sets, and the Welch's t-test was used in case of unequal variances between 2 compared data sets.The Wilcoxon-Signed Rank test was used when both compared data sets were not normally distributed and when normally distributed data sets were compared with data sets which were not normally distributed.It was also used to compare two data sets where any one of the data sets or both had outlier data points [13].The detailed results of statistical test conducted during study can be found in supplementary data A (DSC), B (HD) and C (sDSC).

Results
The performance of each individual AI-based auto-contouring system in contouring twenty three different organs at risks considered in various clinical cases (head and neck, brain, lung, breast, pelvis, and abdomen) was quantitatively evaluated by calculating the DSC, HD and sDSC between contours of each tested organ contoured manually by expert (Manual) and automatically by each software, Radiotherapy AI (RTAI), Limbus AI version 1.5 (Lim1.5)and version 1.6 (Lim1.6),Therapanacea (TH), MIM (MIM), Siemens AIRC (SAIRC) and RadFormation (RF).The higher DSC, sDSC and lower HD value illustrate better agreement with the Manual.The mean, standard deviation, range, and maximum absolute difference of the DSC for each considered OAR case in head and neck and brain cases are illustrated in Table 2. Similarly, the values for lung, breast, pelvis, and abdomen cases are presented in Table 3.The mean, standard deviation, range, and maximum absolute difference of the maximum HD for each considered OAR case in head and neck and brain cases are illustrated in Table 4. Similarly, the values for lung, breast, pelvis, and abdomen cases are presented in Table 5.The mean, standard deviation, range, and maximum absolute difference of the surface DSC for each considered OAR case in head and neck and brain cases are illustrated in Table 6.Similarly, the values for lung, breast, pelvis, and abdomen cases are presented in Table 7

Discussion
In this study, seven different AI-based auto-contouring systems were tested to study each system's performance in contouring organs at risk considered in different clinical cases.In general, the study showed sDSC values were considerably smaller than volumetric DSC values, especially for OARs with large volumes as reported from previous studies [10,21,27].
In head and neck and brain cases, the contours delineated by each AI system showed good agreement with reference contours for most of OARs considered.The DSC for brain, brainstem, left eye, right eye, left parotid gland, right parotid gland, left submandibular gland and right submandibular gland from tested AI systems were comparable to the previous study by Doolan et.al [10] and by Liu et.al [19].This study reported slightly lower sDSC for brain, brainstem, left eye, right eye, left parotid gland, right parotid gland, left submandibular gland and right submandibular gland from tested AI systems [10].The HD for the same set of OARs from tested AI systems were slightly higher compared to previously reported HD [10].
The study found that the AI systems had shown reduced and inconsistent performance in contouring small and complex structures such as optic structures and oesophagus which is difficult to visualise in CT images rather than MR.The reduced and inconsistent performance of auto contouring systems in contouring small and complex structures had been previously reported in other studies.The previous      study by Liu et.al [19] reported low DSC value for optic chiasm and wide variation in DSC value for the left and right optic nerve across multiple previous studies.Similarly, the reduced and inconsistent performance was found in this study for oesophagus cases which correlates with previously reported DSC, sDSC and HD values for oesophagus case [10].The Radiotherapy AI software showed the best performance across all tested systems.The better agreement between the Radiotherapy AI contours and manual contours in this study may be due to the fact that the Radiotherapy AI model was trained on our clinic's contours and therefore produced contours similar to those used in our clinic.This result demonstrates the advantages of an in-house built AI system or AI systems which were trained based on clinicspecific data.This would provide contours more similar to those currently used in that clinic.On the other hand, this could perpetuate incorrect contouring and does not provide review of current contouring practice.Nor would it lead to standardisation of contours across radiation therapy centres.However, the study found very small maximum differences in both DSC and HD values across all tested systems.So, in most test cases, the shape of contours delineated by AI systems were comparable to each other.
Low DSC of spinal cord was found across all tested AI systems during this study where previously reported DSC of spinal cord was considerably higher [10,19].This large disagreement occurs because the manual contours only cover the part of spinal cord which lies in the treatment field, while AI systems contour all area of spinal cord in the image as shown in Fig. 1.
There was no specific AI based software showing overall superior performance compared to others in lung, breast, pelvis and abdomen cases.Again, the very small maximum differences in both DSC and HD values across all tested systems supports that the shapes of contours delineated by each AI system are comparable to each other.
The DSC for bladder, left and right lungs, heart, left and right kidneys, liver, rectum and stomach from tested AI systems were comparable to the previous study [1,10].This study reported slightly lower sDSC for bladder, heart, left and right lung, liver from tested AI systems compared to previously reported sDSC [10].The HD for same set of OARs from tested AI systems were slightly higher compared to previously reported HD [10].This study reported slightly lower performance in rectum case compared to previously reported DSC, sDSC and HD [10].
Both left and right femoral head DSC and sDSC were comparable and HD was slightly higher compared to previously reported DSC, sDSC and HD [10]   values and large variation in the average DSC value when compared with other AI software were due to the difference in contouring method of RadFormation, which delineated the femoral head only while other systems and the manual reference contours included a small portion of the femoral neck as shown in Fig. 2.There were several limitations in this study.Firstly, there were limitations in a few tested AI systems' models.The Radiotherapy AI model was only available for head and neck, and brain regions, while the MIM model only contoured structures in male pelvis cases at the time of study.Not long after the analysis of the study was performed, most AI systems updated their models to improve their contouring quality and also offered additional structures to be contoured.Due to the rapid development of the field, it was not feasible to reflect the performance of all tested AI systems up to date.So it must be noted to the reader that this study only reflects the specific version of each tested system which was stated previously in the method section.This implies that clinics, whether in the planning stages of implementing or already having integrated an AI system, require a set of workflows or a tool to assess the AI system's performance.This will be crucial for keeping pace with the rapid advancements in this field.Secondly, the sample size used may have been insufficient to provide adequate power for the statistical tests [30].The sample size for some OARs was very small, with only four or five reference contours for the right submandibular gland and the stomach.So the statistical test performed for data sets with less than five samples were ignored and denoted as ***** in supplementary data A, B and C. Thirdly, in a few cases, some software systems were not able to produce particular contours for every patient.For instance, the Radiotherapy AI produced an incompleted contour of the left optic nerve by contouring on only a single CT image slice in case HN10.Fourthly, the manual contours considered as the reference during this study were contoured by only a single expert.Using crossvalidated contours would have ensured the accuracy of the reference data.Lastly, Baroudi et.Al [3] discussed that to clinically accept the automated contours, the AI systems need to be evaluated in multiple domains such as quantitative evaluation of automated contours using geometric metrics, qualitative evaluation of automated contours by the end users using Likert scales and Turing tests, the dosimetric evaluation of automated contours by assessing the impact on the dose for OARs and targets when automated contours were used in planning, and lastly assessing the improvement of efficiency of clinical workflow when the AI system was used.This study exclusively conducted a quantitative evaluation of automated contours and as one of the main intentions of this study was to provide a starting point or guidance to other clinics that are considering implementing

Fig. 1
Fig. 1 Manual and AI systems' contour of the spinal cord in Varian Eclipse Treatment Planning system

Fig. 2
Fig. 2 3D representation of manual and AI systems' contour of both left and right femoral heads in Varian Eclipse Treatment Planning system

4 2 )
Maximum Hausdorff Distance (HD) values between manual contours and individual automated contours of OARs considered Brain (n = 10) Brainstem (n = 19) Left eye (n = 12) Right eye (n =12) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Left optic nerve (set 1 (n = 10)) Right optic nerve (n = 11) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Limbus AI v1.6 ) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) ) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) . The study found that DSC values of RadFormation were lower and HD values were higher compared to other tested AI for both left and right femoral head cases.The low DSC values, high HD Bold underline values indicate the lowest HD values *No available model from AI system **Result from only 1 tested case ***No contours produced by AI system ****Corresponding organ was not included to be contour by AI system in template Table 4 ) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) STD (±mm) Range (mm) Mean (mm) Bold underline values indicate the sDSC values * No available model from AI system ** Result from only 1 tested case *** No contours produced by AI system **** Corresponding organ was not included to be contour by AI system in template Table 6 underline values indicate the sDSC values *No available model from AI system **Result from only 1 tested case ***No contours produced by AI system ***Corresponding organ was not included to be contour by AI system in template Table 7 into their clinical workflow, additional forms of evaluations are planned as future work.

Table 1
CT parameters used for each tested case Net.The author were unable to identify the network used for Therapanacea ART-plan Annotate.Radiotherapy AI used clinical data from Chris O'Brien Lifehouse as the training data set for its model.The training data set and the data set used for this study were mutually exclusive.Radiotherapy AI is in the development stage and is not commercially available yet.
. Both highest mean DSC and sDSC values and lowest HD value for each case are presented in bold and highlighted.The distribution of individual data for each OAR were tabulated and illustrated in both scatter and box plot, corresponding statistical results are illustrated in Supplementary data A for DSC, B for HD, C for sDSC.The box plot of data for individual AI systems for all considered OARs are shown in Supplementary data 1 (DSC), 2 (HD), 3 to 6 (sDSC with different tau value).

Table 5
Maximum Hausdorff Distance (HD) values between manual contours and individual automated contours of OARs considered