Clinical validation of commercial deep-learning based auto-segmentation models for organs at risk in the head and neck region: a single institution study

Purpose To evaluate organ at risk (OAR) auto-segmentation in the head and neck region of computed tomography images using two different commercially available deep-learning-based auto-segmentation (DLAS) tools in a single institutional clinical applications. Methods Twenty-two OARs were manually contoured by clinicians according to published guidelines on planning computed tomography (pCT) images for 40 clinical head and neck cancer (HNC) cases. Automatic contours were generated for each patient using two deep-learning-based auto-segmentation models—Manteia AccuContour and MIM ProtégéAI. The accuracy and integrity of autocontours (ACs) were then compared to expert contours (ECs) using the Sørensen-Dice similarity coefficient (DSC) and Mean Distance (MD) metrics. Results ACs were generated for 22 OARs using AccuContour and 17 OARs using ProtégéAI with average contour generation time of 1 min/patient and 5 min/patient respectively. EC and AC agreement was highest for the mandible (DSC 0.90 ± 0.16) and (DSC 0.91 ± 0.03), and lowest for the chiasm (DSC 0.28 ± 0.14) and (DSC 0.30 ± 0.14) for AccuContour and ProtégéAI respectively. Using AccuContour, the average MD was<1mm for 10 of the 22 OARs contoured, 1-2mm for 6 OARs, and 2-3mm for 6 OARs. For ProtégéAI, the average mean distance was<1mm for 8 out of 17 OARs, 1-2mm for 6 OARs, and 2-3mm for 3 OARs. Conclusions Both DLAS programs were proven to be valuable tools to significantly reduce the time required to generate large amounts of OAR contours in the head and neck region, even though manual editing of ACs is likely needed prior to implementation into treatment planning. The DSCs and MDs achieved were similar to those reported in other studies that evaluated various other DLAS solutions. Still, small volume structures with nonideal contrast in CT images, such as nerves, are very challenging and will require additional solutions to achieve sufficient results.


Introduction
The evolution of radiation therapy techniques in recent decades has led to major improvements in dose conformality along with precision in dose delivery.Modern methods of dose delivery such as intensity-modulated radiation therapy, volumetric arc therapy, and recently, intensity-modulated proton therapy all have proven to improve local control as well as normal tissue sparing in various tumor types (1)(2)(3)(4).
However, to capitalize on the benefits of these treatment modalities, target volumes and surrounding organs-at-risk (OARs) must be carefully delineated on computed tomography (CT) images.This is heavily demonstrated in the case of head and neck cancer (HNC) that often lie in complex anatomical locations surrounded by numerous OARs.The delineation of neighboring structures is a time-consuming manual process, that mandates experienced knowledge of the local anatomy.Furthermore, manual delineation introduces inter-observer variability as evidenced in several recent studies (5)(6)(7).OAR delineation guidelines have been published by many authors to combat this but vary widely causing difficulty when comparing dose-volume relationships across studies (8).Consensus guidelines were established in 2015 integrating advice and expertise from radiation oncologists from across the world (9).However, even with established evidence-based guidelines, inter-observer variability still exists as shown in a study conducted by van der Veen et al. (10)in which only around half of the participating radiation oncologists utilized the standardized guidelines.
Methods of utilizing advancements in automatic segmentation techniques have emerged to potentially combat lengthy processing times and wide inter-observer variability.Atlas-based autosegmentation (ABAS) is one such method in which an 'atlas' of OARs is established by training a software program with a dataset that has OARs already labeled.An explanation of this process has been published by Han et al. (11).This technique has been proven to reduce processing time as well as generating appropriate sedimentation for various OARs (12,13).Another automated segmentation method currently being investigated is deeplearning-based auto-segmentation (DLAS).DLAS utilizes machine learning to incorporate vast datasets and generate an automated solution.This technique has shown promise in recent studies assessing the efficiency in the head and neck region (14,15).
There are several commercially available ABAS software programs as well as in-house developed DLAS programs that have been validated on HNC.A comparison study conducted by La Macchia et al. ( 16) compared three auto-segmentation programs and reported significant reductions in time to generate quality contours when compared to manual processes.However, many studies have shown the contours generated by either ABAS or DLAS methods still require additional manual editing to be clinically acceptable (16)(17)(18).In a study conducted by van Dijk et al. ( 19), time to generate clinically acceptable contours still proved to be significantly less when created using either ABAS or DLAS.Even still, the authors noted that the DLAS method used outperformed ABAS when evaluating on a cohort of HNC patients.However, large cohorts of training data sets are required to train and get an accurate DLAS model.It is not feasible for each center to develop and train their own DLAS model, thus commercially available models or shared DLAS packages would be advantageous.Evidence from clinical validation of DLAS packages will allow for centers to identify solutions that will provide optimal performance for their particular needs.Thus, this study aims to clinically validate the generic models for HNC OAR autocontouring of two commercially available deep-learning-based auto-segmentation software packages, AccuContour (version 3.1, Manteia Medical Technologies, Wisconsin, MI) and ProtègèAI (version 1.0, MIM Software Inc., Cleveland, OH), Both MIM and Manteia's solutions include a generic HNC autocontouring model.Quantitative evaluation on a set of 40 clinical HNC patients will be performed.

Expert contour creation
To validate each DLAS model, a cohort of 40 HNC patients were selected who previously received treatment at our institution.All patient data in this retrospective study was approved under an internal review board.All planning CT (pCT) images consisted of 512 pixels × 512 pixels in each slice with voxel size of 0.98mm×0.98mm×1.50mm.All the CT data were acquired on the same version of CT scanner (Somatom Definition AS, Siemens, Forchheim, Germany) without contrast enhancement.In total, 22 OARs were manually contoured to establish the expert contours (ECs) on the pCT images by expert radiation oncologists at our institution according to international consensus guidelines.These OARs were divided into five groups: 1) Glandular: submandibular and parotid glands; 2) Aerodigestive Tract: oral cavity, larynx, esophagus, constrictor muscle; 3) Ocular and Aural: cochlea, lens, eye; 4) Neural: brainstem, chiasm, spinal cord, optic nerves, temporal lobes; 5) Other: mandible.

DLAS contour creation
In contrast to atlas model-based auto-segmentation which utilizes a trained model of shape and appearance characteristics of anatomy structures and then project onto a new image set through deformable imaging registration, DLAS uses deep neural network architectures with multiple (2 or more) hidden layers to learn features from a dataset by modeling complex nonlinear relationships.These architectures are usually formed by stacking several different-type layers that transform input images to the desired output.The transformation through convolution filters, or kernels, reveals local connectivity between neurons of adjacent layers exploiting spatially local correlation.This permits the networks to learn features both globally and locally allowing the network to detect subtle variations in the input data, which here mean the features of different OARs.The training processes generally utilize supervised learning by back-propagation algorithms, which optimize the node weights to minimize the loss between the predicted and known output through each training iteration to a satisfactory level of accuracy.It remains to be unanswered how many patient scans are optimal to produce clinically acceptable results.One would agree that a robust dataset that includes a large variability of patient anatomies would achieve reasonable and robust model.During past years, more and more commercial DLAS software have emerged and become clinically available.The following two different DLAS packages were implemented in our clinic and assessed in this study: Manteia AccuContour is a commercial deep-learning-based auto-segmentation software using deep convolutional neural network models based on a U-Net architecture, the design of which follows the work of Ronneberg et al. (20).The training data included in the model consists of 100 HNC image sets acquired from GE, Philips, and Siemens CT scanners.The HNC model was then applied to the same 40 pCTs used in the expert contour creation.As with the expert contour creation, a unique set of autocontours (ACs) was generated for each patient case.
The MIM ProteǵeÁI generic HNC model is a cloud-based deep learning segmentation model with a similar structure to U-Net.The training data included in the model consists of about 400 HNC images gathered from 31 institutions mainly across the US, but with a few additional institutions located in Europe, Hong Kong, and Australia.MIM's HNC model was also applied to the 40 pCTs used in the expert contour creation.No post-processing was completed after the 3D volume generation for each contour.Again, a unique set of ACs was generated for each patient case.The MIM ProteǵeÁI autocontour model did not provide contours for temporal lobes, cochlea, or the constrictor muscle.Thus, only 17 OAR contours were generated for each patient.

Evaluation metrics
For performance evaluation, the Sørensen-Dice similarity coefficient (DSC) (21), and the mean distance (MD) between the ECs and ACs for the OARs of each patient were calculated as a comparison metric.The DSC is defined as: Which describes the overlapping volume between two structures A and B. A value of 0 indicates no overlap; a value of 1 indicates complete overlap.The MD is a bi-directional measure of the distance between the surface of two contours and is defined as: Where N(A) is the total nodes on the surface of A structure, min d(a, b ∈ S B ) is the minimum Euclidean distance of node a i to any point b on surface of B. The smaller mean distance indicates the surfaces of A and B are closer to each other.For every clinical case, each OAR delineated on the ECs and ACs was compared, generating a DSC and MD value.Results are presented as averages across the patient cohort for each OAR with ranges describing the performance of grouped OARs and mean ± standard deviation describing the performance of specific OARs.

Discussion
Both ABAS and DLAS methods have shown promise in reducing variability and time required to establish contours (22)(23)(24)(25).This study aimed to clinically validate two DLAS commercial software programs by comparing automatically-generated OAR contours with those created manually for a cohort of 40 HNC patients.DLAS contours were evaluated with two gold-standard geometric measures.Reasonable agreement was shown for the glandular OARs, eyes, brainstem, spinal cord, oral cavity, esophagus, and mandible across both autocontour programs.Moderate agreement was shown for the optic nerves and lenses, constrictor muscle, and larynx (ProteǵeÁI specifically).There was poor agreement for the larynx (AccuContour specifically) and chiasm.The results demonstrate that each DLAS package can adequately contour most HNC OARs efficiently in an independent cohort of patients.
We were able to provide results that closely resemble those reported in other DLAS studies in the head and neck region.There is a consensus on the efficiency of DLAS programs to contour the mandible and brainstem as reported by Brunenberg et al. (26) (DSC 0.95 and 0.87, respectively) validating another commerciallyavailable DLAS program, DLCExpert ™ .The results for the glandular OARs and optic nerves in this study resemble those reported by Ibragimov and Xing (15) (DSC parotid gland 0.78, submandibular gland 0.73, optic nerve 0.65) using a convolutional neural network (CNN) approach.Willems et al. ( 27) reported pharyngeal constrictor muscle results (inferior, middle, and superior constrictor muscle average DSC 0.55) that closely resemble the overall constrictor muscle ACs generated in this study.Overall, structures with larger volumes appeared to be easier for the DLAS models to contour as given by the larger DSCs.
Figure 2 shows that all the ACs generated by MIM ProteǵeÁI and Manteia AccuContour were comparable except for the larynx.The average DSC for the larynx contoured by AccuContour was noticeably lower than that of the larynx contoured using ProteǵeÁI.When investigated further, it was noted that the larynx AC within AccuContour was consistently omitting the airspace within the larynx when generating a contour (Figure 3).Our institution's standard of practice for contouring the larynx is to include the entire structure as well as the airspace within.This apparent discrepancy led to less agreement between the EC and AC for the larynx using AccuContour, while ProteǵeÁI was able to contour in a similar fashion to our experts.Fortunately, Mantiea's AccuLearning software allows for the creation of in-house models.Should we want to establish a DLAS model using our institution's method of contouring the larynx, we would be able to do so in the future.The OAR contours from each DLAS model which suffered the most were the chiasm and optic nerves (Figure 4).The low average DSCs of these structures do not appear unique to our study, as several other papers have reported low values for the chiasm and optic nerves (28)(29)(30).This can likely be attributed to small volumes of these structures as well as low contrast to the surrounding brain tissue in CT, making it difficult to accurately segment the structures as noted by a study from Ren et al (31).In this study, investigators were able to improve small structure DSCs (including the chiasm and optic nerves) using a specialized 3D CNN approach.In a 2018 study aiming to improve segmentation for small volume structures in the head and neck region, Tong et al. (28) were able to rescue the low DSC of the chiasm and optic nerves to an extent by employing a Shape Representation Model (SRM) to a Fully Convolutional Neural Network (FCNN), improving the average DSC for these small structures.Thus, further effort to adjust autocontouring  solutions to accurately contour structures with small volumens appears advantageous.
Our study does have potential limitations.First, the generic HNC model within MIM ProteǵeÁI does not currently contain the temporal lobes, constrictor muscle, or cochlea.However, MIM's research & development team has confirmed that these structures will be included in future updates and can be clinically validated at that time.We also only validated each model on 40 pCTs from one institution, thus limiting the power and generalizability of our results.Lastly, additional manual editing of ACs after creation was not performed, and only the first iteration of ACs generated was compared.
Thus, one consideration for future use is the practice of assessing ACs with additional manual editing where needed.This practice has proved useful in improving contour acceptability in previous studies.Willems et al. ( 27) recorded the time required to correct ACs as well as resulting DSCs and found that only an additional 15 minutes total were required to improve the ACs.This was also confirmed in a study by Teguh et al. (22), where the ACs were edited, improving similarity metrics while still requiring less time than manually creating contours from scratch.We did not routinely edit the contours generated by ProteǵeÁI or AccuContour, but future work could include this practice to assess time sparing and contour improvements.
While the auto-segmentation similarity metrics investigated in this study are commonly used, future research should incorporate advanced comparison metrics such as the surface DSC, as described by Vaassen et al. (32).A comprehensive overview of additional metrics is provided by Kiser et al. (33), who corroborate previous findings that surface DSC may better correlate with clinical applicability.Additionally, this study focused on the performance of commercially available auto-segmentation solutions; however, there is a growing trend of research teams developing open-source solutions, available on platforms like GitHub (34,35).These opensource solutions often yield competitive results, rivalling commercial software and may be of particularly beneficial for research studies with limited funding or in clinics with constrained resources.Beyond the two commercial artificial intelligence based auto-segmentation software evaluated in this study, other options such Mirada, MVsision, Radformation, Raystation and TheraPanacea also provide contours of comparable quality for the organs at risk (OARs) in the head and neck region (36).The structures covered by each system are adequate for clinical application and can be customized for specific anatomical sites.The reported accuracy metrics of each system should be considered as one of the critical factors in the decision-making process.Additionally, institutions should evaluate the cost, service quality, and integration capability with existing clinical workflows when selecting an auto-segmentation solution.

Conclusion
Both commercially available DLAS programs were able to significantly reduce the time required to generate OAR contours, even though manual editing of ACs is likely needed prior to implementation into the clinic.The DSCs and MDs achieved were similar to those reported in other studies that evaluated various other DLAS solutions.Still, structures with small volumes are difficult to generate accurate ACs for and will require additional solutions to achieve sufficient contours.
FIGURE 2 Comparison of Manteia AccuContour and MIM ProtègèAI for various OARs Sørensen-Dice similarity coefficients (A), and Mean Distances (B).

FIGURE 3
FIGURE 3Comparison of AC and EC Larynx contour.
Table 1 lists the average DSC and MD values between ECs and Manteia AccuContour ACs for each OAR.
Table 2 lists the average DSC and MD values between ECs and MIM ProtègèAI ACs for each OAR.

TABLE 1
DSCs and MDs for multiple-subject Manteia AccuContour ACs vs. ECs.

TABLE 2
DSCs and MDs for multiple-subject MIM ProtégéAI ACs vs. ECs.