Creation of knowledge‐based planning models intended for large scale distribution: Minimizing the effect of outlier plans

Abstract Knowledge‐based planning (KBP) can be used to estimate dose–volume histograms (DVHs) of organs at risk (OAR) using models. The task of model creation, however, can result in estimates with differing accuracy; particularly when outlier plans are not properly addressed. This work used RapidPlan™ to create models for the prostate and head and neck intended for large‐scale distribution. Potential outlier plans were identified by means of regression analysis scatter plots, Cook's distance, coefficient of determination, and the chi‐squared test. Outlier plans were identified as falling into three categories: geometric, dosimetric, and over‐fitting outliers. The models were validated by comparing DVHs estimated by the model with those from a separate and independent set of clinical plans. The estimated DVHs were also used as optimization objectives during inverse planning. The analysis tools lead us to identify as many as 7 geometric, 8 dosimetric, and 20 over‐fitting outliers in the raw models. Geometric and over‐fitting outliers were removed while the dosimetric outliers were replaced after re‐planning. Model validation was done by comparing the DVHs at 50%, 85%, and 99% of the maximum dose for each OAR (denoted as V50, V85, and V99) and agreed within −2% to 4% for the three metrics for the final prostate model. In terms of the head and neck model, the estimated DVHs agreed from −2.0% to 5.1% at V50, 0.1% to 7.1% at V85, and 0.1% to 7.6% at V99. The process used to create these models improved the accuracy for the pharyngeal constrictor DVH estimation where one plan was originally over‐estimated by more than twice. In conclusion, our results demonstrate that KBP models should be carefully created since their accuracy could be negatively affected by outlier plans. Outlier plans can be addressed by removing them from the model and re‐planning.


| INTRODUCTION
Knowledge-based planning (KBP) is an emerging field in radiation therapy which uses machine learning techniques to estimate radiation therapy dose. KBP can be generalized to be the automation of different steps in the creation of a plan based on past practice.
These steps can range from the estimation of field direction, 1 weights of optimization objectives, 2 and even dose distribution. 3,4 The majority of KBP work, however, has focused on estimating dose-volume histograms (DVHs) [5][6][7][8][9] which are commonly used to evaluate plan quality and guide the inverse planning process.
Radiation treatment planning is a complex process which can result in an infinite number of plans; some of which are suboptimal. This is because the final dose distribution is dependent on the geometry of the organs at risk (OAR) with respect to the target. Other factors that can potentially affect the quality of the final plan are differences in dose prescription, 10 treatment technique, and planner experience. 11 It is because of these reasons that plan quality evaluation has been based on user experience primarily making the development of quantitative tools necessary.
KBP tools have been in development by different groups over the past few years. Wu et al. introduced the concept of the overlap volume histogram and used it to estimate the DVHs for OARs, 6 and automate the treatment planning process in head and neck (HN) cases. 12 Zhu et al. used the distance to target histogram, support vector regression, and principal component analysis to estimate DVHs in the context of adaptive radiation therapy. 7 Yuan et al. proved that it is possible to quantify the complex relationship that different factors have on the final shape of the DVH. 9 This group also used their tool to exchange models that summarize plan creation strategies among different institutions, hence providing a means to standardize treatment planning. 13,14 Moore et al. introduced a KBP tool to perform quality assurance on intensity modulated radiation therapy (IMRT) plans and reduce dosimetric variability. 15 Appenzoler et al. described a mathematical framework to estimate differential DVHs using a summation of skew-normal distributions whose parameters were fitted based on previous plans. 8 This work has being further expanded to be used in the case of intracranial lesions. 5 The work described in the previous paragraph has been done using tools developed in house. It is only more recently that a commercial application became available (RapidPlan TM , Varian Medical Systems, Palo Alto, CA). RapidPlan allows the user to estimate DVHs of OARs using "models" which are trained using principal component analysis (PCA) and stepwise regression analysis. This training requires the user to define a number of plans as the training set which can be done in a number of ways, hence necessitating model performance evaluation. Fogliata et al. have evaluated the performance of RapidPlan using volumetric arc therapy in hepatocellular, lung, and prostate cancer. 16,17 Their results showed that RapidPlan models can be used to achieve clinically acceptable plans. Tol et al. evaluated the performance RapidPlan on head and neck and showed the ability to achieve clinically acceptable plans for this site. 18 These studies identified the need to investigate the proper identification of outlier plans, that is, plans that do not seem to follow the general trend of the training set and could have a negative effect on the models.

Delaney et al. systematically introduced dosimetric outliers in KBP
models and found little change in the accuracy of the model. 19 This was attributed to the decreased precision of the estimated DVHs, whose lower boundary was used for planning. Their study focused on the effect of dosimetric outliers and briefly mentions the negative effect of geometric outliers on KBP models. Proper outlier analysis also helps control model over learning, meaning that the trained model only reflects the training set and not similar cases.
This manuscript describes the process used to create models to be used by a wide range of users with emphasis on the steps taken to address all types of outliers. This is the first study, to the best of our knowledge, which categorizes the different types of outliers, provides strategies to address each of them as well as examples of their effect in the model. This extended outlier analysis will benefit users who are starting to build their own KBP models. Furthermore, the models created are included in the commercial distribution of RapidPlan and we believe it is important for the user to understand the philosophy used to create the models as well as the accuracy obtained during their creation. We will show that proper outlier removal can affect the accuracy of estimated DVHs. This outlier analysis becomes particularly important when creating models that have the potential to be used by a wide range of users.

2.A | Overview of the algorithm
The commercial implementation of KBP uses a DVH estimation algorithm that is different from the algorithms described above. This algorithm estimates DVHs by dividing the OAR volume into four different regions: the out-of-field, in-field, leaf-transmission, and overlap region. All regions contribute to the final DVH but each region is modeled differently depending on the desired detail and accuracy.
While both the in-field and overlap regions are dependent on fluence modulation, the shapes of DVHs in the overlap region are similar across all plans. For this reason, and under the assumption that the target is the primary priority in that area, the overlap region is modeled by the mean and standard deviation of the DVHs part corresponding to that overlap area. In the in-field region, however, the shapes of the in-field DVHs vary considerably across all the plans. It is in this region where the major improvements in tissue sparing could be achieved, under the assumption that the target is the primary priority in the overlap area. Thus, this in-field region, where higher modulation is present and higher accuracy is desired, is modeled using principal component analysis (PCA) and regression techniques. PCA is used to transform the histograms into principal component scores (PCS), thus reducing the dimensionality of the problem. PCA is also used to parametrize the geometry-based expected (GED) histogram, a 3D matrix incorporating the target geometry, target dose, and treatment field geometry. Other parameters that affect the PCS of the DVH include the OAR volume, target volume, OAR overlap volume percentage to target, and the proportion of the OAR that is out-of-field.* 2.B | Models creation 2.B.1 | Training plans selection Two models were created, one for prostate, and one for head and neck. Once the anatomical sites were identified, patients who received treatments to these sites were retrospectively selected by searching our institution database. Using previously treated plans was desirable since these plans reflect treatment techniques that are clinically acceptable at our institution.
The models were trained with plans selected to include a wide range of cases found commonly at our clinic. More specifically, prostate treatments included: c. High-risk patients. Patients whose prostate or prostate bed and pelvic lymph nodes were treated. These groups of patients are treated using a sequential boost technique. The CTV of the first phase includes the prostate (or prostate bed) and the pelvic lymph nodes. The prescription dose for this portion of treatment is 46 Gy. The CTV of the second phases is that of the two categories described above but with prescription doses of 32 Gy or 20 Gy.
The training plans used to create the HN models consisted of all types of HN plans treated at our center. The selection of HN plans was limited to those treated using volumetric arc therapy (RapidArc â , Varian Medical Systems, Palo Alto, CA). This was done in order to minimize the number of variables that could affect the behavior of the models; for example, differences in treatment techniques (static field IMRT vs RapidArc). This resulted in a total of 177 plans since the implementation of this technique at our center.
Among these 177 plans, 61 plans were assigned to the validation set (see below) and the reminder 116 plans to the training set. These training plans included a wide range of HN clinical cases; such as nasopharynx, oropharynx, and post-op treatments. Table 1 summarizes the number of HN plans used to create the model in terms of number of targets and dose prescription combinations.
All Prostate plans were treated using RapidArc â with two full arcs. HN treatments were treated using two partial or full arcs depending on whether the disease was unilateral or bilateral respectively.

2.B.2 | Model creation process
The prostate model was created by including prostate plans from all categories (listed in Section 2.B.1) and following a step-by-step process. This process consisted of creating a general prostate model This approach was followed since it was unclear how HN plans should be categorized. Even for treatments of the same site (e.g., oropharynx); there were multiple combinations of dose prescription, changes in the relative geometry of the target with respect to OARs, and number of targets. Using the largest number of plans available, and using scatter plots to identify clusters resulted in a more efficient (and practical) approach.
The models were evaluated using the coefficient of determination and the chi-squared test. In order to recognize potential outliers, the scatter plots (provided by the RapidPlan TM algorithm) were used to detect cases not fitting to the general model behavior. Scatter plots show the relationship between the dependent variable DVH PCS on independent variables and any training case appearing to be exceptionally far away from the regression line was considered as a potential outlier. In addition, potentially influential cases were determined using Cook's distance (the threshold value being roughly 3.0).
Cook's distance was used since it provides a measure of influence of an outlier by omitting the plan from the regression analysis. 20 Being influential means that the single case has a large impact for the regression line and does not necessarily appear as an outlier in the scatter plot. Other metrics provided by RapidPlan TM , such as modified Z scores, studentized residual, etc., were mainly omitted since the same information was available in the scatter plots. The influence of the potential outliers was tested by removing them from the training set and re-training the model. The removal of structures was done T A B L E 1 Details of the plans used to train the HN model. | 217 iteratively by excluding one or two strongest influential cases (or outliers) at a time and monitoring the improvement in the trained model. The removal stopped once no more significant improvement was observed. The chi-squared test was used to monitor over-fitting.

Number
A threshold value of 1.3 was used as an indication of no severe over-fitting.

2.B.3 | Outlier analysis
Outliers are plans which deviate from the general trend in the analysis. Their effect on a model is typically not immediately clear because not all outliers affect the overall trend of the data in the regression analysis.
The first step in addressing identified outlying plans consisted of re-planning them. An alternative approach would consist of removing these plans from the model entirely. However, this does not ensure that the model will work for these patients in the future. Re-planning also helps to reduce the uncertainty (width of the estimate band) in the estimate. Care was taken to ensure that dose reduction to an OAR of interest did not result in an increased dose to another OAR nor reduce target coverage. Treatment plans for which re-planning was able to reduce the dose effectively are dosimetric outliers and the outlying plans were replaced by the new plans. Treatment plans for which re-planning was unable to reduce the dose for a given Over-fitting means that the estimation model only describes the training set but may not generalize well for other cases. Outliers that cause over-fitting are a special type of outliers and occurs when a single plan increases the number of variables needed to predict the DVHs of the training set. Over-fitting was evaluated by inspecting all of the scatter plots for each OAR in conjunction with the goodness of fit and chi-squared metrics. The ideal way to address this type of outliers is to find plans with similar characteristics and add them to the training set. Finding similar plans is challenging (or potentially impossible) and, hence, plans that caused over-fitting were removed from the model.

2.C | Models validation
The validation process consisted of using the trained models to esti- at points corresponding to 50%, 85%, and 99% of OAR maximum dose (which we will refer to a V50, V85, and V99). Yuan et al. 9 quantified this difference with respect to the prescription dose but these metrics would become irrelevant for OARs whose dose is less than 50% of the prescription dose. Instead, the doses used to com-  Table 2 below.
Both models were used to create plans as part of the validation process. The lower boundary of the estimated DVHs were set as optimization objectives with fixed priorities for all validation plans.
These are called line objectives and were, arbitrarily, set with a priority value of 50 for the bladder, rectum, and femoral heads. Two point objectives were set for the planning target volume (PTV) with a priority of 120: a lower V98% = 100% and an upper V102% = 0%.  Two ellipses are also shown to encircle the clusters that make this data bipolar. The plans enclosed by the ellipse on the right of the graph correspond to those whose CTV include the pelvic lymph nodes while all the reminder of the categories are encircled on the ellipse on the left. Forty-three plans were added since a higher number of plans were necessary in order to obtain an evident clus- ter. Figure 1 also shows the line fitted by the model along with the lines corresponding to one and two standard deviations. It can be seen that the line fitted by the model results in a compromise. Given that treatments involving pelvic lymph nodes are substantially different from the other types of prostate plans, these types of plans were excluded from the general prostate model and a model-specific for this category would have to be created. Once pelvic lymph nodes plans were removed, the bladder scatter plot showed that the reminder categories of prostate plans followed the same trend. Table 3 lists the values of the metrics used to evaluate the goodness of statistics for the first and final models. Figure 2 Figure 6 shows the scatter plot that was initially obtained for the parotids (laterality combined as a single training structure, similarly   to the femoral heads in the prostate model). Figure 7 shows the corresponding scatter plot that was obtained after two plans were replanned (dosimetric outliers) and the parotids for three plans were excluded (influential geometric outliers). Further investigations showed that these three outliers, encircled in the upper right corner, corresponded to plans with large overlaps of the parotids with the targets. In fact, similar plans in the validation set showed that these plans resulted in a third cluster (tri-polar data). These three plans were then removed since bi-polar and tri-polar data are better described by separate models. The third group of patients was excluded from the model and, instead, the model was validated including these types of plans to investigate its performance under these circumstances. Figure 7 shows the presence of two clusters corresponding to contralateral parotids (grouped in the lower left corner) and the remainder of the parotids. While the accuracy of the model can be improved by creating separate models for these two groups, this would require the user to decide which model to use.

3.A.2 | HN model
This approach is not possible if the proper model cannot be determined upfront, as was the case here, and so it was decided to merge both parotids into the same model and evaluate the accuracy resulting from this compromise. Note that this approach is different from that followed to create the prostate model, where a separate model would have to be created for high risk patients which can be easily identified.
Merging the bipolar data resulted in a compromise. A direct consequence of this compromise is a larger upper estimate boundary (e.g., see the long tail of the upper estimate in Fig. 8). This effect is because the final DVH estimate is obtained by summing the average DVH (which is calculated from both contralateral parotids and parotids that receive larger doses) with the DVHs reconstructed from the principal components. Table 4 below summarizes the changes that were done to the first HN model along with a description of why they were made.
These changes were done in multiple iterations. Table 4 shows that a total of 20 structures were removed from the model because they were causing over-fitting. The majority of these plans resulted in over-fitting as a result of limitations in the algorithm, such as having doses on the order of the threshold dose used to define DVHs that require in-field regression vs those DVHs that do not (and that can be described by other DVH components, e.g., leaf transmission).
Over-fitting was particularly noticeable for the mandible, where the stepwise regression also returned terms corresponding to the product of PCS (belonging to different and orthogonal principal components). These products of PCS have no physical meaning and lead us to remove more plans in order to avoid them. A total of eight plans were excluded for the mandible and, even after these eight structures were removed from the model, over-fitting was still taking    Figure 9 also shows the variation in DVH shape for this OAR.

3.B | Validation results
Note that merging of the femoral heads into a single model structure (each femoral head was contoured separately) was done given that the treatment technique used at our center is symmetric.
Merging the femoral heads into a single structure changed the mean values of the V50, V85, and V99 by values which were within the uncertainty (standard error of the mean).  Fig. 8 above). The pharyngeal constrictor is the least accurate structure; however, the steps taken to improve the model result increased accuracy for this OAR as shown in Fig. 11.  Fig. 11 and corresponds to a mean dose of 17.6 Gy. Figure 12 compares the parotids estimated and clinical mean doses which are commonly used as a clinical tolerance. The average difference between the estimated and clinical mean doses was À0.8% AE 0.4% with a standard deviation of 3.6%.
Tables 8 and 9 list the average difference between DVHs of plans generated using the models and those from the independent clinical plans. The differences were calculated for the clinical goals most commonly used (listed in the second column). The results of Table 8 show an agreement of 0.3% for the V40 Gy metric of the rectum. This is despite the reduced accuracy of the V50 metric for the estimated DVH. This is because the line objectives are placed on the lower boundary of the estimate which coincides with the clinical DVH in many cases (see Fig. 10). The V50 metric on the other hand, was calculated based on the estimated average.     previous tools. 9 The models in this study were compared by calculating the difference between the volume of the DVH curve in the clinical plan and the volume of the estimated DVH. These volumes were compared at three points along the x axis which corresponded to the 50%, 85% and, 99% of the maximum dose (of the clinical plan). These points were used in a previous publication 9 and provided a benchmark for model comparison. The selected metrics can also be generalized to plans with different prescriptions, where many clinical goals would be irrelevant. Another advantage of the metrics used in this study can be seen in the planning results for the femoral heads (Table 8). Table 8 18 There is, however, no evidence that one metric is better than the other and this topic needs further investigation. The modified V50, V85, and V99 metrics used in this study can be used to evaluate the accuracy of estimated DVHs for all plans irrespective of dose prescription and provide a reasonable way to quantify the shape of the estimated DVH.
The models were also used as optimization objectives as part of the validation process. It should be noted that while some authors have benchmarked their models by using them to guide the optimization process, 18 others have focused entirely on the accuracy of the estimated DVHs, 9,21 and others in both. 8 While the need to re-plan all plans remains debatable, we decided to use our models as optimization objectives for the validation process. This provided an end-to-end test for the models and ensures that the model does not compromise on untrained structures. The models were therefore evaluated in three different ways: (a) by analyzing the accuracy of the estimated DVHs on the training set (summarized in the goodness of fit statistics), (b) by analyzing the accuracy of estimated DVHs on an independent validation set, and (c) by using the estimated DVH to guide the inverse planning process. This therefore resulted in a rigorous model commissioning approach that gives confidence of its performance.
Planning can be done in a variety of ways, potentially leading to different results. Therefore, the planning of this study was done by means of an automated run without user interaction. This approach leads to user independent results and is more practical (due to the large number of validation patients). However, the approach is limited since the presence of untrained OARs could not be accounted for. Our results show that the models can be used by users outside of the institution where they were originally created yet preserve the planning trend. Additional planning could be conducted for cases where differences between the estimated and the original DVHs are significant since it implies potential for dose reduction (cases where the DVH was under-estimated) or increased dose (cases the DVH was over-estimated). The question of what difference is considered significant to require additional planning is still unknown and is beyond the scope of this publication.
The presence of bipolar data was evident in the bladder of the prostate model and in the parotids of the HN model. The simplest way to address bipolar data is to simply separate them as was done in the prostate model. The bipolar data for the parotids is due to ipsi-and contra-lateral parotids and its separation is more challenging as this would require a user to come up with a strategy to accurately identify these groups of parotids. Incorporation of techniques that aid in cluster classification could improve the accuracy of parotid DVH estimates. 22

| CONCLUSION S
The creation of KBP models requires diligence since the presence of outliers can affect the accuracy of the estimated DVHs. A combination of multiple tools should be used to identify and address outliers.
The process followed to create models presented in this publication led to DVH estimates with an accuracy of À2%-4% for the prostate model and from À2% to 8% for the HN model with agreements in parotid mean doses off less than 1% on average. More automated ways to analyze model creation would be desirable and would allow investigators to handle outliers that would have a negative effect on a model.

ACKNOWLEDG MENTS
The authors would like to thank Aldrich Ong and James Butler from CancerCare Manitoba's Radiation Oncology Department for their help with questions regarding contours. We would like to thank Yves Archambault and Helen Philips from Varian Medical Systems for providing the system to test KBP. Finally, we would like to thank Lindsey Olsen and James Kavanaugh, from Washington University at St.
Louis, for their invigorating discussions regarding the creation of KBP models.