Deep-learning magnetic resonance imaging-based automatic segmentation for organs-at-risk in the brain: Accuracy and impact on dose distribution

Highlights • Deep-learning-based automatic segmentation of brain organs-at-risk was fast and accurate.• Automatic contouring reduced inter-observer variability.• Dice similarity coefficients correlated with size, with values > 0.8 for larger structures.• Geometric variations had minimum impact on dose distribution.


Introduction
Radiotherapy is part of the multidisciplinary management of both primary and metastatic brain tumours, frequently in combination with systemic therapy and surgery. Each of these treatment modalities carry individual risk of adverse effects, which is increased when they are combined [1][2][3][4][5]. Radiation-induced acute toxicities can be dose-limiting or life threatening, and late effects such as hearing or visual impairment, endocrine disfunction, and neuro-cognitive decline can be detrimental for long-term survival patients [6][7][8].
Delineation of organs-at-risk (OARs) is essential for radiotherapy planning, enabling healthy tissue sparing and toxicity risk estimation. With the availability of high-quality imaging such as magnetic resonance imaging (MRI), an increasing number of brain structures can be accurately delineated; including them in the plan optimization process can help to minimize the risk of adverse events, increase the therapeutic ratio, and improve the quality of survival. However, manual contouring is often challenging in terms of time burden and accuracyconsiderable variability still exists between observers [9][10][11] despite the availability of anatomical guidelines and atlases [12,13]. Variations in contours can impact dosimetry and plan quality, with subsequent potential impact on treatment outcomes [14][15][16]. Several automatic and semi-automatic tools for target and OAR delineation are currently available or under investigation, including traditional atlas-based or machine-learning approaches and innovative artificial intelligence (AI) neural networks/ deep learning-based algorithms. These methods could potentially optimize workflows, shorten planning times, and harmonize practice by reducing inter-and intra-observer variability in delineation [17,18]. Some AI-models have shown good results in terms of contouring accuracy, with several programs already commercially available [14,19,20]. However, it is not yet established if a faster delineation translates into a faster overall planning time, or if the existing geometric discrepancies in contouring have any impact on the dose distribution when compared to the current human-based standard.
In this study we evaluated the feasibility and performance of using a T1-MRI model of a commercially available, guideline-based, deeplearning automatic segmentation software for organ-at-risk delineation in the brain. We addressed the workflow, time burden, and geometric accuracy of the AI-assisted method compared to the human/clinicianbased process and investigated the potential impact of the geometric differences on dose distribution, to clinically validate this method on an independent data set.

Material and methods
This retrospective study was reviewed and approved by the local ethics committee. All patients have previously consented for their data to be used for research purposes and all patient data was anonymized prior to this investigation.

Patients, delineation and geometric comparison
Thirty adult (>18 years) patients with primary brain tumors that had previously completed radiotherapy in a tertiary cancer center were randomly selected. A set of 22 OARs per patient were manually contoured by a radiation oncologist specialized in brain radiotherapy, based on the EPTN (European Particle Therapy Network) recommendations [12,13], using the planning computed tomography (CT) rigidly registered with a gadolinium enhanced T1-weighted MRI scan. The patients had tumours in various locations: occipital (n = 8), parietal (n = 8), temporal (n = 7), midline (n = 4), frontal (n = 2) and the posterior fossa (n = 1). Most of the hemispheric tumours were on the left (16) and nine were on the right side. Four scans were acquired with 3Tesla MRI scanners and 26 with 1.5Tesla machines. Slice thickness was one millimeter or less for 17 MRI scans and 1.5-5 mm for the other 13. Slice thickness was proportional with the total number of slices, the latter being further used as a surrogate for the analysis. Manual delineations were expert approved and considered as the reference/ground-truth structures. For each case, a second segmentation was then performed using a commercial, deep-learning, guideline-based artificial intelligence (AI) software (MVision GBS™, Version 1.2.2). The T1-MRI Brain model v1.2.2 consists of a convolutional neural network with encoderdecoder architecture, which was trained using an ADAM optimizer on 90 cerebral 3D-iso T1-MRI (1.5Tesla), following the EPTN contouring guidelines [21]. The AI structures were then uploaded into the local planning software and transferred to the rigidly registered planning CT. Subsequently, AI structures were manually edited, until they were considered clinically acceptable, thus obtaining a third structure set (AIedit). Delineation time for each step was documented.
To capture inter-observer variability, five cases were randomly selected for a contouring exercise in which two radiation oncology trainees (RO A and B) manually delineated the 22 OARs, thus having five alternative structure sets for these cases (RO_A, RO_B, AI, AIedit, and Reference). For geometric comparison, Velocity™ software (Varian Medical Systems, Palo Alto, CA, USA) was used to compute the Dice Similarity Coefficient (DSC) [22] and surface distances (mean = MSD, median, maximum = Hausdorff Distance, HD). The "union" function was used to merge all structures except the Brain (which includes all organs but the Brainstem and its' subdivisions), obtaining one single structure encompassing 21 OARs (Union_Ref, Union_AI, Union_AIedit, Union_RO-A, Union_RO-B).

Planning and plan comparison
Volumetric-modulated arc therapy plans were generated for 15 selected representative cases using the same plan objectives and optimization criteria as the original plan, defined within the institutional protocol, which is based upon the EPTN consensus recommendations [23]. A dose of 60 Gy (standard fractionation) was prescribed to the Planning Target Volume (PTV) [24]. Each plan was re-optimized based on the three structure sets, (Plan_Ref, Plan_AI, Plan_AIedit) thus obtaining 45 different dose distributions. PTV and any originally contoured OARs not included in the 22 test structures (Cochlea_L/R, Eye_L/ R, Lens_L/R, Retina_L/R) were copied from the original plans (previously used for patient treatment). Both the manual delineations and treatment planning were performed using Eclipse 15.6 (Varian Medical Systems, Palo Alto, USA). Dose-Volume-Histogram (DVH) comparison and Dmean/Dmax collection for each plan was performed based on the reference structures, as they were considered to best reflect the real anatomy, using the DICOM registration and plan comparison functions in Velocity. Dose differences were defined as follows: ΔDmean_AI/AIedit = Dmean_Ref minus Dmean_AI/AIedit and ΔDmax_AI/AIedit = Dmax_Ref minus Dmax_AI/AIedit, in absolute values. SNC Patient 8.2.0 (Sun Nuclear Corporation, Florida, USA) was used for relative gamma analysis with 3%/3mm pass criteria and a 10% dose cut-off.

Statistical analysis
Wilcoxon signed-ranks test and Bland Altman plots were used for paired, non-normal distributed data, correlations were assessed with the Spearman correlation coefficient(ρ), and volume thresholds were calculated with receiver operating characteristic (ROC) curves, using the R software (R Foundation for Statistical Computing, Vienna, Austria). A detailed description of the workflow is shown in Supplementary material-Figure S1.
The automatic tool failed to generate 27 of the 660 expected OARs (4%), three of which missing because of tumour invasion, thus having an overall success rate of 96% for delineated structures. Eleven cases had at least one missing structure, with an average of 0.9 structures per case. The OpticChiasm was missing in six patients, followed by the CorpusCallosum (n = 4), OpticTract_L/R (n = 2), and Pituitary (n = 2). Other missing structures were OpticNrv_L/R and GlndLacrimal_L/R (n = 2/1 and 1/2). Fig. 1: A, B-shows an example of a representative case.
The differences between median values for Dmean_AIedit/Dmean_-Ref and Dmax_AIedit/Dmax_Ref were smaller than 0.5 Gy. Individually, statistical significance(p < 0.05) was only reached for OpticChiasm and OpticNrv_L, with higher Dmean for AIedit. Dmax_AIedit was higher for Hypothalamus and OpticNrv_L and lower for GlndLacrimal_R and Pons (p < 0.05).
No statistically significant difference was found between Dmean/ max_AI and Dmean/max_AIedit.

Discussion
In this study, we evaluated an automatic segmentation tool for brain OARs in a clinical setting, with focus on workflow, time, and accuracy for reducing inter-observer variability. Furthermore, we tested the feasibility of using the AI-generated contours for planning, with or without subsequent human intervention and manual adjustment. The clinical validation of this deep learning-based AI tool was performed on an independent data set than the one used for training the model. The Fig. 1. A, B-Example case with reference structures in darker and AI-generated structures in lighter shades. C: Axial MRI showing inter-observer variability between clinicians (yellow and green), reference structures (cyan) and AI contours (red). automatic segmentation took significantly less time than manual contouring. However, additional time was needed for other manually performed steps: exporting and importing structures in the local planning system, transferring structures on the planning CT and potential corrections.
One of the strengths of this study is the large number of OARs (22) included and the use of MRI for delineation, which is essential in brain radiotherapy. This is valuable especially in the era of intensitymodulation and inverse planning, where inclusion of OAR in the optimisation process is essential to improve OAR sparing. Other authors have mostly reported on a smaller number of brain OARs, with focus on optic nerves, chiasm, brainstem, and hippocampi (Supplementary material - Figure S5) [15,17,19,[25][26][27][28][29][30]. There is limited data reporting on MRI-based auto-segmentation in the brain, as most of the reports are CToriented [16,18,[31][32][33][34][35][36][37]. Nonetheless, some OARs (Cochlea, Eyes, Retina, and Lenses) are not included in this model and are of high importance for radiotherapy planning and would require manual delineation. These, together with GlndLacrimal_L/R, OpticNrv_L/R, are generally better visualised on CT, which might explain the lower DSC when contoured on the MRI scan, with several reports showing better results for CT-based segmentation [33,37,38].
Our results indicate that auto-segmentation can reduce interobserver variability, with similar or higher DSC reported for AIgenerated contours for most OARs, when compared to RO_A and RO_B. The overall DSC was very good for the AI-generated structures and increased after the manual editing, leading to smaller differences in dose distribution, but adding a considerable amount of time. Poorer results were found for small structures, especially the optic apparatus, with DSC values around 0.45-0.55, however these structures are usually rapidly delineated by clinicians. Nonetheless, DSC has its limits in evaluating contour similarity, as it is sensitive for small structures, but it becomes less sensitive for larger ones. Other used metrics, such as the   surface distances, were highest for the large structures (Brain, Brainstem, and its subdivisions), mostly caused by discrepancies in superior and inferior borders of these structures. There are alternative geometric parameters available which may bring additional information on the accuracy of the auto-segmentation tool, metrics such as surface dice coefficients and Jaccard index, [39] which we did not use in this study. The exclusion of these metrics could be considered a limitation, along with the subjective matter of assigning a ground truth, as no delineation, despite expert-approval, can perfectly replicate real anatomy and is performed on a single snap-shot time point that does not always represent the anatomy. It should be noted that, distortions of the contours might have occurred during the MRI-to-CT transfer, likely due to imperfect registration, variations in CT/MRI slice thickness and different scanning angles. The correlation between the DSC and the number of MRI slices (which was used as a surrogate for slice thickness) indicates that for optimal results a good quality MRI scan is needed.
Our results showed that plans optimized on raw, AI-based contours were similar to those based on the reference contours, with only minor differences. Dmean values were higher for Plan_AI and conversely, Dmax was slightly lower for Plan_AI, suggesting that some differences might be random or caused by other factors than delineation. One such factor could be the optimization process, which might deliver slightly different solutions to the same problem, even if very similar structures and identical criteria are used for planning. Therefore, we believe that reported differences in dose metrics are not exact reflections of variations in geometry and statistical differences do not necessarily translate into clinical significance. Still, ΔDmean and ΔDmax showed a modest negative correlation with DSC, meaning that lower DSC might lead to a bigger dose difference, but was not confirmed on an individual OAR basis. These results are supported by the Bland Altman test, which showed minimal/no bias and narrow limits, suggesting that the two methods are nearly identical. The gamma pass rate showed a good similarity between dose distributions for most plans, with some lower values, mostly in cases with missing structures. Other authors also noted that contouring variations might impact dose distribution, but these results should be carefully interpreted in terms of clinical relevance, given the multi-factorial aspects of dose calculation and distribution [14][15][16]40].
Considering all aspects, deep-learning, MRI-based autosegmentation for brain OARs can be a valuable tool in clinical practice, with good accuracy especially for large structures, but further developments are required for smaller organs. This method could significantly reduce work burden, shorten treatment preparation time, and reduce inter-and intra-observer variability, with a clinically acceptable impact on the dose distribution.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.