Evaluation of an Atlas-Based Auto-Segmentation Tool of Target Volumes and Organs at Risk in Head and Neck Radiation Therapy

Introduction: Radiation therapy plays a crucial role in the required multidisciplinary management of Head And Neck Cancer (HNC). The use of highly sophisticated radiation therapy techniques for HNC mandates highly precise target volumes and Organs at Risk (OARs) delineation, which can be challenging. Computerised contouring algorithms have shown potential by improving delineation precision, inter-observer consistency and reducing workload. This study aims to evaluate a commercially available Atlas-Based-Auto-Segmentation (ABAS) module, assessing its accuracy, clinical applicability, as well as to explore the potential role of ABAS in the evolving era of artificial intelligence and deep learning contouring (DLC). Methods: The ABAS model was created using 100 HNC patients’ imaging data. A second cohort of 20 patients imaging data was used to evaluate the ABAS delineation results compared to expert manual contours. For each of the 39 regions of interest (ROIs) for every patient, commonly used quantitative metrics, the Dice Similarity Coefficient Index (DICE), the Hausdorff distance 95th-percentile (HD) and the Volume Ratio (VR) were obtained and compared. A subjective evaluation (Turing test) and a time evaluation were also performed. Results: The performance of the ABAS model tested, regarding the quantitative metrics, was similar to other ABAS solutions, and mostly, slightly inferior to presented DLC solutions. The subjective evaluation showed that although not all ABAS contours are precise, it is not always straightforward to identify contours as being human-or computer-created. The time evaluation suggested that ABAS, followed by manual editing, can significantly reduce the time and effort needed for the segmentation of the ROIs of the head and neck regions. Conclusion: Auto contouring will play an important role in the future of radiotherapy planning. Using ABAS or DLC modules and fine-tuning the results is efficient for clinical utility and can considerably save time for clinicians in delineating ROIs for the head and neck region


Introduction
Head and Neck Cancer (HNC) represents a global public health issue with a significant socioeconomic impact, as it accounts for more than 650,000 new cases and 300,000 deaths annually, not including thyroid cancer [1]. It is the sixth most common cancer in the world and affects significantly more males than females, with a ratio ranging from 2:1 to 4:1 [1][2][3]. Tobacco use (both smoked and smokeless), areca nut, betel quid, alcohol, and human papillomavirus (HPV) infection are considered to be the most well-studied risk factors for all types of HNC, but more specifically for oral and oropharyngeal cancers [4][5][6].
The optimal management for patients with HNC requires a multidisciplinary approach, as precision patient-specific medicine needs to be incorporated. The multidisciplinary team includes surgeons (oral and maxillofacial, otolaryngology, plastic and reconstructive), radiation oncologists, medical oncologists, radiologists, pathologists and nuclear medicine consultants, among others. Surgery and/or radiotherapy, with or without neo-adjuvant, simultaneous or adjuvant systemic therapy, are the key treatment modalities. The goal of radiotherapy is to eradicate cancer cells, mostly by damaging their DNA, while minimizing radiation damage to critical healthy tissue. Radiotherapy, utilising photons or protons, for patients with HNC is complex and has greatly evolved in the past decades, owing to the advent of intensity-modulated radiotherapy techniques and adaptive treatment planning.
Intensity-Modulated Radiation Therapy (IMRT) is an advanced form of high-precision, three-dimensional conformal radiotherapy, using computer-optimized inverse treatment planning and a computer-controlled multileaf collimator. With the use of static or rotational IMRT, the intensity of radiation beams can be modulated, so that a higher radiation dose can be delivered to the target volumes, with a sharply conformal coverage, while at the same time the dose to surrounding normal tissues is significantly reduced [7]. As IMRT is a highly precise technique, it requires apart from reproducible patient setup/immobilisation and optimal imaging modalities, highly precise target volumes and organs at risk (OARs) delineation.
Although delineation guidelines that contribute to the reduction of contouring variations and the simplification of multiinstitutional clinical trials conduction, as well as to improved care of HNC patients, do exist [8], considerable variation in practice policy is noticed among clinicians and institutions [9]. Additionally, the average time required for adequate manual delineation of target volumes and OARs for HNC patients is approximately 120 minutes. Moreover, the already challenging task of manual contouring is getting more complex and timedemanding, as a consequence of the increasing number of OARs found to be associated with radiation-induced side effects. Against this background, auto-contouring of target volumes and OARs has the potential to reduce time and effort, and to improve delineation precision and inter-observer consistency [10][11][12].
Atlas-Based-Auto-Segmentation (ABAS) is a widely used method in which a set of a specific number of representative patients with carefully delineated OARs serves as a reference set for contouring new patients [13]. Contours of the reference patients are registered to new patients, in order to be transformed and implemented in the new scan. Although ABAS has already significantly reduced workload and improved consistency in clinical practice, there are a number of limitations, such as suboptimal performance for small target volumes and OARs or for patients with differing anatomies from those used as reference, as well as errors due to deformation inaccuracies [14][15][16].
Segmentation methods, using deep learning techniques have emerged as a promising method of coping with these challenges. Deep Learning Contouring (DLC) typically trains a Convolutional Neural Network (CNN) model directly from high quality data [17]. Improved computing power and adequate training of neural networks have made deep learning methods available for autosegmenation purposes. Several studies have already shown the potential of CNNs for HNC contouring and for other sites [18][19][20][21][22]. The key feature of the CNNs is their ability to recognise patterns, group and use them to predict results, without the need of human interventions, but more importantly, to be able to evolve on the basis of experience. It is self-explanatory that the CNNs performance is based on the quality of data used to train them. As noted, high quality data (manually produced) are challenging and time-demanding. ABAS-based, manually fine-tuned data might contribute to this task [17].
This study aims to evaluate a commercial ABAS module included into a dedicated radiotherapy contouring software (ProSoma, Medcom GmbH, Darmstadt, Germany) for HNC patients. The primary goal is to assess the accuracy and clinical applicability of the generated target and OARs contours, using quantitative geometric metrics and subjective evaluations, and to compare the time needed between manual and ABAS segmentation. The secondary goal of the study is to calculate the time for required manual corrections for the contours to be clinically acceptable, so that the software could be a useful tool in creating quality data for CNNs.

ProSoma atlas data set
A total of 100 patients' DICOM files were used to build the atlas data set. Eligible patients for this data set were patients with non-specific primary tumour or treatment site, that did not undergo a head-neck surgery for any reason, positioned and simulated in supine position, in order to be treated with curative or palliative radiation therapy. All the patients were immobilised with a custom five-point commercial thermoplastic mask (IMRT-Mask-Precut-MR-09 Ò , Unger Medizintechnik GmbH, Mülheim-Κärlich, Germany), using standard head-neck cushions (4-pieceset-of-Head-Support-CushionsÒ, Unger Medizintechnik GmbH, Mülheim-Κärlich, Germany). parotids, submandibular glands, mandible (temporomandibular) joints, sternocleidomastoids, as well as the brain, brainstem, optic chiasma, hypophysis, spinal cord, larynx, hyoid, mandible, trachea and body surface. All the manual segmentations were delineated by a dedicated team of radiation oncologists with expertise in HNC, according to previously published international consensus delineation guidelines [23,24].

Reference and test data set
A total of 20 patients' DICOM files were used to build the reference and test data set. Eligible patients for this data set were HNC patients that did not undergo a head-neck surgery for any reason, positioned and simulated in supine position, in order to be treated with definitive radio chemotherapy. All the patients, similarly to the reference data set, were immobilised with a custom five-point commercial thermoplastic mask (IMRT-Mask-Precut-MR-09 Ò , Unger Medizintechnik GmbH, Mülheim-Κärlich, Germany), using standard head-neck cushions (4-pieceset-of-Head-Support-Cushions Ò , Unger Medizintechnik GmbH, Mülheim-Κärlich, Germany).
For each patient, a planning-CT scan (2.5 mm), using a multi-slice CT GE Discovery CT590 RT (GE Healthcare, Chicago, USA), was performed, followed by the manual segmentation of the same target volumes and OARs as for the ProSoma atlas set. All the manual segmentations were delineated by a dedicated team of radiation oncologists with expertise in HNC, according to previously published international consensus delineation guidelines [23,24].
Automatic segmentation was subsequently performed for all patients using the auto contouring module of the software. For the automatic segmentation, the same target volumes and OARs as for the ProSoma atlas and the ABAS reference data sets were delineated.

Quantitative evaluation
The autocontouring module's performance was evaluated by comparing the differences between the automatically generated (ABAS test data set) and manual contours (ABAS reference data set) using the following metrics: • the Dice similarity coefficient (DICE), which quantifies the overlap between contours A and B: [25] • The Hausdorff distance 95th-percentile (HD), i.e. the 95th percentile of the pairwise 3D point distances between two structures' contour in mm [26].
• The volume ratio (VR) between the contours of the two groups

Subjective evaluation
A subjective evaluation of the contouring methods was carried out with a Turing test, which assumes clinical usability of autogenerated contours, if they are difficult to distinguish from manual contours [27]. A HNC radiation oncology expert (observer), who was not involved into the project, was invited to take the Turing test to prevent potential bias. Following the approach described by Gooding, et al. [27], the observer was blindly presented with random slices that had an equal probability of featuring manual or auto-generated contours for all target volumes and OARs. The observer assessed the following questions for 100 scenarios: • A single contour: ''How was this contour drawn?" Answer options: ''By a human" or ''By a computer".
• Two contours: ''Which contour do you prefer?" The preferred contour is selected by the observer. ''Accept it as it is; the contour is very precise".

Time evaluation
The time needed for the manual and automatic segmentation of every ROI for each patient, as well as the manual correction time of all ROIs, to be clinically acceptable, were recorded.

Quantitative evaluation
For all 20 patients, the autocontouring module did not fail to generate contours for any Region of Interest (ROI). The results for every ROI, target or OAR are presented in groups according to the ROIs volume. Group A consists of ROIs smaller than 1 ml, group B consists of ROIs between 1 ml and 3 ml, Group C consists of ROIs between 3 and 20 ml, Group D consists of ROIs between 20 and 40 ml and group E consists of ROIs larger than 40 ml (Table  1)       Regarding VR values for group B, the autocontouring module had a good performance for bilateral mandible (temporomandibular) joints and hyoid, while for bilateral inner ears and Level IA (Figure 6       Regarding VR values for group D, the autocontouring module had a good performance for all ROIs apart from right parotid and trachea (good intermediate performance) (Figure 12 For Group E (>40ml) the autocontouring module had a good performance regarding DICE for all ROIs (Figure 13). Brain DICE value was the highest (0.97±0.01), while the DICE values for mandible was the lowest in Group E (0.79±0.09). Regarding HD values for group E, the autocontouring module had a good performance for brain, a good intermediate performance Regarding VR values for group E, the autocontouring module had a good performance for all ROIs ( Figure 15). Brain had the best VR value (1.00±0.03), while mandible had the worst VR value (1.12±0.17) in group E.

Subjective evaluation
For the question ''whether contours were human or computer-drawn", 32% of the human-drawn contours were misclassified as computer-created. The misclassification rate for module's contours was 21% ( Figure 16 For the question ''which of 2 contours was preferred", human-drawn contours were substantially more often preferred (81%) than either computer-created (19%). For almost all ROIs apart from brain, human-drawn contours were more often preferred ( Figure 16). The responses to the question ''Would you correct the contour?" suggest low rates of large, obvious errors (5%) in humandrawn contours. In contrast, 29% of computer-created contours were considered to have obvious errors. Minor errors that required correction were less often found in human-drawn contours (9%) in comparison to computer-created contours (12%). 86% of the humandrawn contours and 59% of computer-created contours would be accepted would be accepted as they were with minor or without any corrections ( Figure 17).

Time evaluation
The average time needed for the manual segmentation of each patient was 102,3 minutes, while the time needed for the autocontouring was 2,4 minutes. The average time needed for the manual correction of all ROIs for each patient in order to be clinically acceptable was 52,7 minutes.

Discussion
Target and OARs manual delineation is not only a challenging task that is time consuming, but also can lead to uncertainties due to inter-and intra-observer variations in modern radiotherapy that utilizes precise techniques. Additionally, this challenging task is constantly getting more complex, as an increasing number of OARs is shown to be associated with radiation-induced side effects, especially in the head and neck region. Against this background, auto-contouring of target volumes and OARs has the potential to reduce delineation precision, time and effort, and to improve interobserver consistency [10][11][12].
Due to recent technological developments, deep learning techniques have emerged as a promising method of autodelineation. Training a Convolutional Neural Network (CNN) model directly from high-quality data has improved the autosegmentation results [18][19][20][21][22]. It seems that high quality data is the key for an efficient CNN. Manually edited ABAS-atlases, that can minimise the discrepancies among clinicians, have the potential to simplify the collection of such data, while respecting the personal contouring preferences of the clinicians.
In our study we evaluated a commercial ABAS module included into a dedicated radiotherapy contouring software (ProSoma, Medcom GmbH, Darmstadt, Germany), which is based on an atlas-library consisting of 100 patients. We assessed the performance of the module using 20 patients' DICOM files, delineating 39 ROIs, and comparing the manual contours to the module-created contours. Quantitative and subjective evaluation was performed. It has to be noted that the metrics we used, are common metrics used in comparing contours, but their interpretation is not always straightforward. As new technologies are tested, we strongly believe that standardised, reliable sets of both quantitative and subjective metrics will arise, allowing the more accurate and reproducible contours comparison methods. According to the authors' knowledge, our study is the study with the most different ROIs tested (39) regarding the head and neck region in the literature.
In the following section we will compare the results of our study with these of previous published studies, and discuss the reason for the discrepancies, if any.
Lenses: Fortunati, et al. proposed an autosegmentation method primarily for hyperthermia treatment of head and neck tumours that combines anatomical information based on atlas registration with local intensity information in a graph-cut framework [28]. This novel method was compared with a multi-atlas ABAS method for multiple organs including lenses. The ABAS method they used Lacrimal Glands: As no studies were found in the literature that used ABAS methods for autosegmentation of the bilateral lacrimal glands, a recent study by Nikolov, et al. that used deep learning techniques is used for comparison [29].The authors reported DICE values of 0.62±0.13 and 0.69±0.12 for the left lens, as well as DICE values of 0.61±0.11 and 0.70±0.12 for the right lens. Our results (average DICE: 0.27±0.07 and 0.29±0.10 for the left and right lacrimal gland, respectively), are significantly worse. This can be explained by the fact that different methods were used. It is widely accepted that deep learning techniques provide better autosegmentation results [21,22].

Hypophysis:
As no studies were found in the literature that used ABAS methods for autosegmentation of the hypophysis, a study by Tao, et al. that used an ABAS method followed by manual correction of the ROIs and compared the results to expert contours, to assess interobserver variability, is used for comparison [29]. Tao Optic Nerves: Walker, et al. used a commercial ABAS method for multiple head and neck ROIs, including optic nerves, and compared it to ABAS followed by manual editing and manual segmentation [30]. They showed DICE values of 0.71±0.26 for the ABAS method used. These DICE indexes are superior to our results (Optic Nerve L: 0.56±0.11, Optic Nerve R: 0.56±0.12). Possible explanations could be the different delineation extent of optic nerves and optic chiasma or the better performance of the ABAS method used for this specific ROI.

Mandible (temporomandibular) Joints:
No studies were found that evaluated any autocontouring method for the above ROIs, although some authors consider them significant OARs [35]. Our quantitative indexes have been presented previously.
Level IA-V: As no studies were found that used ABAS methods for autosegmentation of the head and neck lymphatic levels, separately, two studies that used ABAS for all the lymphatic levels as a single ROI are used for comparison. Yang et al. have shown DICE and HD indexes of 0.77±0.03 and 10±2.1mm for a singleatlas approach, as well as DICE and HD indexes 0.78±0.03 and 9.9±1.7mm [36] for a multi-atlas approach, respectively. Voet, et al. reported similar DICE values of 0.81 (0.69-0.95) and 0.82 (0.64-0.89) [37]. Although our indexes (average DICE for Levels IA-V:0.57) appear to be significantly inferior, the results are not comparable, as both studies compared the whole lymphatics as a single ROI.
Hyoid: No studies were found that evaluated any autocontouring method for the above ROI. Our quantitative indexes have been presented previously.  [30].

Sternocleidomastoids:
No studies were found that evaluated any autocontouring method for the above ROI. Our quantitative indexes have been presented previously.

Trachea:
No studies were found that evaluated any autocontouring method for the above ROI. Our quantitative indexes have been presented previously.

Body surface:
No studies were found that evaluated any autocontouring method for the above ROI. Our quantitative indexes have been presented previously.
It has to be noted that assessing clinical usability of contours based on geometric measures alone is very challenging. Moreover, comparisons between autocontouring methods based on metrics, such as DICE, HD or VR is of unknown value.

Subjective evaluation
The Turing test we performed, showed that although not all computer-drawn contours are precise, it is not always straightforward to identify contours as being human-or computercreated. The fact that 32% of the human-drawn contours were misclassified as computer-created and 21% of the computer-drawn contours were misclassified as human-created, indicates the degree intra-and interobserver variability. Although manual contours seem to be more often preferred, almost 60% of the computerdrawn contours would be accepted with minor or no editing. A significant limitation of the Turing test we performed, is the presentation of single slices for ROIs evaluation. If the threedimensional information was provided to the observer, the results could have been different.

Time Evaluation
The time evaluation suggests that ABAS, followed by manual editing, can significantly reduce the time and effort needed for the segmentation of the ROIs of the head and neck region. ABAS methods have been clinically implemented for adaptive radiation therapy for a number of years in many radiotherapy clinics and are being increasingly used in the daily routine. For institutions that wish to build neural networks, manually edited ABAS methods might be the procedure of choice.

Conclusion
It is self-explanatory that autocontouring will play an important role in the future of radiotherapy planning. The commercial ABAS module included into a dedicated radiotherapy contouring software (ProSoma, Medcom GmbH, Darmstadt, Germany) that we evaluated, has a similar performance to other commercial atlas-based solutions in terms of commonly used quantitative and subjective metrics. The time evaluation showed significant benefit for the daily routine. We demonstrated that using this module and fine-tuning the results is efficient for clinical utility and can considerably save time for clinicians in delineating ROIs for the head and neck region.