Applying pytorch toolkit to plan optimization for circular cone based robotic radiotherapy

Robotic linac is ideally suited to deliver hypo-fractionated radiotherapy due to its compact head and flexible positioning. The non-coplanar treatment space improves the delivery versatility but the complexity also leads to prolonged optimization and treatment time. In this study, we attempted to use the deep learning (pytorch) framework for the plan optimization of circular cone based robotic radiotherapy. The optimization problem was topologized into a simple feedforward neural network, thus the treatment plan optimization was transformed into network training. With this transformation, the pytorch toolkit with high-efficiency automatic differentiation (AD) for gradient calculation was used as the optimization solver. To improve the treatment efficiency, plans with fewer nodes and beams were sought. The least absolute shrinkage and selection operator (lasso) and the group lasso were employed to address the “sparsity” issue. The AD-S (AD sparse) approach was validated on 6 brain and 6 liver cancer cases and the results were compared with the commercial MultiPlan (MLP) system. It was found that the AD-S plans achieved rapid dose fall-off and satisfactory sparing of organs at risk (OARs). Treatment efficiency was improved by the reduction in the number of nodes (28%) and beams (18%), and monitor unit (MU, 24%), respectively. The computational time was shortened to 47.3 s on average. In summary, this first attempt of applying deep learning framework to the robotic radiotherapy plan optimization is promising and has the potential to be used clinically.


Introduction
Robotic radiotherapy (CyberKnife, Accuray Inc., Sunnyvale, CA) enlarges treatment space and improves delivery versatility upon the standard linear accelerator, making it ideal for the delivery of stereotactic radiosurgery (SRS) [1,2] and stereotactic body radiotherapy (SBRT) [3,4]. The compact linac head is mounted on a highly accurate and reliable robotic arm. The robotic arm travels along the path connecting a set of pre-programmed discrete points (referred as node) distributed non-coplanarly. By orientating the linac head at these nodes, any beam pointing to the target could be delivered. The original collimator system included 12 diameter fixed circular cones, which was later upgraded to the variable aperture IRIS collimator [5]. Recently the vendor introduced multi-leaf collimator (MLC) to the system. Although the MLC increased the flexibility for field size and improved the treatment efficiency for irregular shaped targets, the circular cones are still in use widely due to its highly conformal dose distribution [6]. Furthermore, the MLC system has not yet been widely adopted, with only about 20% systems installed globally at the current time according to the vendor. Therefore, any improvement in the circular cone based plan optimization will have a big impact in the clinical operation of these systems, which is the aim of this study.
The non-coplanar treatment space improves the delivery versatility but also leads to prolonged treatment time. A typical treatment plan may use up to 100 beams/nodes and 2-3 size cones to get a conformal dose distribution, and the treatment time may take 30 min to 1 h, depending on the tumor sites [7]. The most effective approach to shorten treatment time is to use fewer beams, nodes and cones in the plan. This is because the beam positioning (including the switch of cones, the reorientation within each node and the robotic movement between the nodes) time, rather than the beam-on time takes the major part of the treatment duration. The need to improve the treatment efficiency means to reduce the number of beams and nodes without jeopardizing the quality of plan. Previously, we presented a singular value decomposition accelerated linear programming (SVDLP) optimization method [8]. This method used the least absolute shrinkage and selection operator (lasso) [9] for beam reduction. The number of nodes may also be reduced with the decrease of beams, but the issue was not fully addressed, at least not directly. Naturally, the beams delivered at the same node belong to the same "group". The aim is to search for a sparse solution, for both the individual (beam) and group (node) aspects. In this study, we adopted both the lasso and the group lasso [10] regularization terms to address the "sparsity" issue.
Gradient based optimization approaches methods were commonly used for planning optimization [11,12]. The computation of derivatives is often the most timeconsuming step. Based on the chain rule of differential calculus, automatic differentiation (AD) replaces the domain of variables with the computation graph, and had been used for intensity-modulated radiation therapy (IMRT) planning optimization [13] as early as 2006. But back then, the limited packages and supported programming language hindered the application [14]. Recently AD received increasing attention with the emergence of deep learning (DL), as it had been used for the training of large-scale network. The recent development of DL also provided several well-established framework such as tensorflow [15], caffe [16] and pytorch [17], which saved the trouble of "reinventing the wheel" for regular investigators. In this study, we adopted pytorch framework for the circular cone based planning optimization, and explained how to use the well-validated DL framework for inverse planning.
The presented optimization approach used AD to compute derivatives, and the lasso and the group lasso to address the "sparsity" issue. Therefore it is later referred as AD-S approach for simplicity. Twelve (12) clinical cases of different complexities (brain and liver) were used to validate the AD-S approach. Currently, the commercial treatment planning system (TPS, MultiPlan) implements a sequential optimizer (SO) [18,19]. In November 2018, the vendor released a new TPS (Precision v2.0), and upgraded the optimizer to VOLO. VOLO implemented graphical processing unit (GPU)-based dose calculation, replaced the sequential optimization with a single cost function integrating all clinical goals (widely used in the field of IMRT/VMAT optimization), and pruned the beams with lower monitor unit (MU) during optimization. As far as the authors know no centers or institutions in China has the Precision 2.0 installed currently. Therefore a direct comparison with Precision is not possible, and indirect comparison based on published data were performed and discussed.

Methods and materials
In this section, the AD-S approach and optimization experiments were presented. First the procedure of beam initialization was introduced briefly. Then the optimization model and the lasso and group lasso regulation terms were described in details. The topology and transformation of the optimization model was presented in the following sub-section. Finally the patient data implementation details were described.

Beam initialization
Generally, two methods are used for beam initialization. The first one is to start with a sub-set of randomly initialized candidate beams, then eliminate the beams with lower weights and add new candidate beams after each iteration of the optimization. The SO of Multiplan system used this "drop-and-pick" mechanism. Due to the intrinsic heuristic nature, it is inefficient to cover the complete treatment space. The other method is to initialize candidate beams to cover the entire beam space, and use all initialized beams for optimization simultaneously. This method takes full advantage of the treatment space but also means greater computational cost. The second scenario was adopted in our previous work [8] with a novel technique for beam initialization. The mechanism simulates the process of moving the equivalent dose taper to cover the entire target volume. The candidate beams were initialized with the center at the edge of adjacent beams. Both dosimetric analysis and patient data-based experiments proved that the mechanism could thoroughly cover the beam space without introducing redundancy. The same mechanism was adopted in this study.

Optimization model
With the initialized candidate beams, the dose of each voxel (D i ) is expressed as: m and n is the number of voxels and candidate beams, respectively. d ij is the dose delivered to ith voxel by the jth beam, w j is the weight of the jth beam. W is the weight vector. For each optimization objective, the cost function is of the general format: N is the number of voxels of certain region of interest (ROI), g is the condition function varying with the objectives. D 0 i is the prescription dose for planning target volume (PTV) or the tolerance dose for organs at risk (OARs). Table 1 lists the condition functions of some typical objectives.
The total objective function is the summation of all objectives and the lasso and group lasso regularization terms: where α and λ are the weight of objective function and regularization terms, respectively. | | and 2 represents the operation of L1 and L2 norm, respectively. W l is the weight vector composed by the weights of beams belonging to the lth node.

Topologizing optimization model and training practice
The total dose (D i , i ∈ (1,2,…, m)) of each ROI voxel is contributed by all beams (d ij , j ∈ (1,2,…,n)), which is calculated as the weighted (w j ) linear combination of d ij (Eq. 1). This operation can be topologized into a simple feed-forward neural network, as shown in Fig. 1. This network is of the simplest architecture: it contains no hidden layers, and only the basic input and output layers. The neuron weight represents the beam weight. In this way, the optimization of beam weight is transformed into the tuning of neuron weight, i.e. the treatment planning optimization is converted into "network training". As listed in Table 2, the "training" is different from typical DL network: (1) For typical DL, the network was trained and tested on training and testing datasets, respectively. For treatment planning, the network was trained for each single patient. Once the training was completed, the optimization was also finished. No testing process was required. (2) For DL network training, only a small subset of training dataset was used (minibatch strategy) for each iteration. But for treatment planning network, the dose delivered to all voxels by all beamlets (d ij , i ∈ {1, 2, …, m}, j ∈ {1, 2, …, n}) were feed

Table 1 Condition functions of typical objectives
H is the Heaviside function. For dose volume (DV) objectives, head and tail refer to the voxels at the top and end position sorted by dose in descending order into the network to calculate the loss function for each iteration.
(3) For DL, the desired results were the whole network. In other words, the goal was to obtain correct results via the network; the exact weight of each neuron was not the concern. But for this work, the weight of each neuron was the weight of corresponding beam, which were the desired results. With this transformation, the well-established framework for DL could be tapped to solve the treatment plan optimization problem. The regularization terms are designed to filter out the beams of very limited contribution to the plan. In practice, it was found that the weights of these beams were reduced to a very small value but not exactly zero. In order to reduce the computational cost, these beams are removed during optimization, i.e. the network is dynamically adjusted by trimming the neurons with marginal weights. The threshold of trimming beam weight was set to 0.01 MU. In addition, an early stopping criterion was adopted: the iteration is terminated when the number of beams does not decrease within certain consecutive iterations. The number of consecutive iterations is set to 10.

Materials: patient data and machine parameters
The Cyberknife system provides two sets of pre-defined nodes for head and body treatment. The treatment position determined which set of nodes to use. The AD-S approach was validated on two types of cases: 6 brain and 6 liver cancer cases. For the brain cases: patient 1 was lymphoma case, patient 3 and 5 were glioma cases with large PTV, and the rest were metastases. For each case, the AD-S plan used the same set of nodes as MLP plan. Table 3 listed the patient data, including the prescription dose (D P ), fraction dose, number of fractions, volume of PTV (V PTV ), number of voxels of all ROIs, the sizes of used cones, the number of available nodes and initialized beams. Because the computational time of AD-S approach is directly related to the planning parameters, it was also listed. The spacing of optimization dose grid was 2 mm × 2 mm × 2 mm, and the ROI was sampled to 2 mm, correspondingly. The MLP plans were designed and approved by experienced physicists and physicians. The AD-S plans were optimized using MLP plans as reference. The optimization terminated if the obtained plans were no inferior to MLP plans. The size of cones were selected by experienced planner according to the size and  shape of PTV. The AD-S approach used the same set of cones with the MLP plans. The beam initialization was implemented in C++, and dose was calculated using ray-tracing method on GPU. Plans evaluation was performed on an in-house platform, which was developed and validated in our previous study [8,20]. The MLP plans were also imported to the platform, and all the following mentioned plan evaluation indices were calculated under the same resolution (2 mm × 2 mm × 2 mm) with AD-S plans. Please note: since the resolution was different, it may result in slightly changes in these indices. The optimization was implemented with pytorch (v0.4.0, Paszke et al. [17]). The critical codes could be found in the suppelmental material. The example of using pytorch for treatment planning optimization was illustrated with a simple case in Additional file 1. Among the available optimizers in pytorch, it was found that the L-BFGS is the fastest algorithm with satisfactory robustness. Therefore the L-BFGS optimizer was selected in this study. Note that: because the L-BFGS includes certain logical operations and the network is simple and shallow, it executes faster on CPU than on GPU. So the optimization was performed on CPU in this study. All computations were performed on a desktop computer with Intel Xeon E5-2620 processor and NVIDA GeForce GTX 1080ti GPU.

Results
In this section, the detailed result of one sample liver case, including the objective value and the numbers of nodes and beams during optimization, was first presented. And then the evaluation and comparison of the treatment plans generated by the AD-S approach (later referred as AD-S plans) and MLP system (later referred as MLP plans) on the cohort of patients were presented. Figure 2 showed the objective value (F in Eq. 4) and the numbers of nodes and beams as functions of iterations of the first liver case. The objective dropped dramatically at the first iteration, but much slower for the subsequent iterations. The number of beams decreased rapidly at the first several iterations, and continued to decrease, and remained the same for the last 10 iterations due to the early stop strategy. The number of nodes decreased gradually through the whole optimization. The beam and node reduction does not increase the objective value, indicating the number of beams and nodes was reduced without jeopardizing the plan quality. Table 4 compares the evaluation indices between AD-S and MLP plans. For the brain cases, the AD-S plans reduced the number of nodes, beams and MU by 26%, 17.5% and 26% on average. For liver cases, these indices were reduced by 29%, 18% and 22.2%. And for all cases, the reduce ratio was 27%, 18% and 24%. The conformity index (CI) is defined as the ratio of the volume covered by D p over the volume of PTV (V PTV ). For the brain, liver and all cases, CI was improved by 18%, 19% and 19%. The homogeneity index (HI) is defined as the ratio of the maximum dose (D max ) within PTV over D p . On average, the HI of AD-S plans for brain cases was slightly lower than MLP plans but slightly higher for liver cases. Paired t-test was performed on the indices of all AD-S plans against MLP plans. The results indicated that the reduction on the number of nodes and beams, MU and CI was significant. The change of HI was not significant. We would like to point out that: only a soft constraint was imposed on D max , and no certain height (like 150% of D p ) was compulsory for either MLP or AD-S plans. We only limited the maximum dose of PTV at 110% of prescription dose with a relative low weight. The average computational time of AD-S approach was 39.6 s (23.0-56.1 s) for brain cases, 38.8 s (20.8-56.7 s) for liver cases and 39.2 s (27.5-56.7 s) for all cases. Figure 3 showed dose gradient comparison of AD-S and MLP plans. For each case, the volume covered by 50-100% of D P (V 50 -V 100 ) was normalized to corresponding V PTV . As shown in Fig. 3, the AD-S curve was under the MLP curve for each type of case, indicating that the AD-S plans achieved higher/better dose gradient around PTV, i.e., more rapid dose fall-off.

Evaluation and comparison
Tables 5 and 6 listed the maximum and average dose of the OARs of the brain and liver cases, respectively. Table 5 also listed the V 5 , V 10 and V 20 of the liver organ. The dose of AD-S plan greater (worse) than the MLP plan was denoted in bold. For the brain cases, the dose of  AD-S plans was greater for 8 out of 60 items. Out of the 8 items, 5 had deviations less than 30 cGy, and the maximum deviation was 166 cGy. For the liver cases, the dose of AD-S plans was greater for 14 out of 96 items. Out of the 14 items, 11 had deviations less than 70 cGy, and the maximum deviation was 191.9 cGy. These demonstrated that the AD-S plans had better overall OAR sparing. Figures 4 and 5 showed the dose volume histograms (DVHs) and the dose distribution of the first brain and liver cases, respectively. For the brain case, the minimum dose with PTV of MLP plan was significantly lower, corresponding with the "round-ended" the DVH curve. This may be caused by the node or beam reduction procedure of MLP, which was to remove the nodes or beams with lower weight after weight optimization. For the ADS approach, the beam and node reduction was implemented during optimization, which would not jeopardize the plan quality. As shown in Fig. 4, the dose of the rings  outside PTV of the ADS plans was lower than the MLP plans, indicating more rapid dose fall-off, which could also be observed on the dose distribution comparison of Fig. 5.

Discussion
In this study, we topologized the optimization problem into a simple feed-forward neural network, thus converted the treatment plan optimization into network training, and furthermore utilized the automatic differentiation approach for network training. In this way, the well-established pytorch DL framework was used. Our results demonstrated that the framework could handle both the brain and liver cases with different complexities.
In addition, the optimization speed was also satisfactory. As far as we know, this was the first attempt to successfully adapt machine learning toolkit to the cone-based treatment plan optimization. Using pytorch for inverse planning is naturally different from the typical application of DL. In standard DL, the weight of single neuron has no clear meaning, and the whole network acts like a "black box". For this application, the weight of each neuron represents the weight of each candidate beam. The other difference is the practice of "network training". For DL, over-fitting may be the most challenging issue, meaning that the network may have excellent performance on the training dataset, but poor performance on testing dataset. The mini-batch training strategy is used to avoid over-fitting, which is to randomly select very few samples for training at each iteration. In this study, the beam weights were optimized separately for each plan. This means the network was trained from scratch, and all voxels were training dataset, thereby avoiding the issue of over-fitting. Therefore the full-batch training strategy was used in our strategy. For results comparison, since the authors do not have access to the latest VOLO, we compared AD-S with the MLP system. The reduction rate of the AD-S approach over MLP was 27.5% (MU), 17.6% (node) and 24% (beam), which was comparable with VOLO according to published studies. Zeverino et al. reported the reduction rate of VOLO over MLP was 36% (MU), 14% (node) and 31% (beam) [21]. Schüler et al. found that the MU and beam reduction rate was 21.8% and 22.0% for 6D skull tracking plans, and 28.1% and 28.4% for Xsight spine tracking plans [22]. Giżyńska et al. reported that the reduction rate was 48.7%/32.8% (MU), 13.4%/7.9% (node) and 26.5%/7.9% (beam) for prostate/lung plans [23]. It is worth to point out that different approaches in optimization algorithms were used in VOLO and our approach. We also admitted that the AD-S system was compared with MLP using clinical MLP plans. More comprehensive comparison like re-optimize the MLP plans to obtain better plans, and meanwhile re-optimize the AD-S plans using these improved plans as reference could make the study more solid.
To our knowledge, it will take about rather long time for MLP system to obtain a clinical practical plan. The optimization time may be several hours for large irregular target. We want to point out that the circular cone and MLP system are not well-suited for large irregular target. The recently introduced MLC together with the VOLO optimizer would surely shorten the optimization time and improved the quality of treatment plan. Applying the framework of this work for more widely used MLC treatment planning, either CyberKnife system or linac system, will be our next work in the future.
Recently the vendor introduced multi-leaf collimator (MLC) to the system. Although the MLC increased the flexibility for field size and improved the treatment efficiency for irregular shaped targets, the circular cones are still in use widely due to its highly conformal dose distribution [6]. Furthermore, the MLC system has not yet been widely adopted, with only about 20% systems installed globally at the current time according to the vendor. Therefore, any improvement in the circular cone based plan optimization will have a big impact in the clinical operation of these systems, which is the aim of this study.
In our previous SVDLP algorithm we developed a linear model and solved the model with Gurobi (Gurobi Optimization, Inc., Houston, TX, USA). The improvements of AD-S over SVDLP include: (1) Both beam and node reduction was integrated into the model; (2) The optimization model was free from the limitation of linearity, and the dose-volume constraints could be directly added; 3. Open-source toolkit replaced the commercial solver. For the same brain (patient 3) and liver (patient 3) cases of our previous study, the AD-S plans achieved comparable dose distribution but used fewer nodes and beams than SVDLP plans. In addition, the computational time was 627 s and 285 s for the SVDLP approach, and 84.8 s and 32. 1 s for AD-S approach, almost an order of magnitude improvement.
Our results demonstrated that the AD-S approach used fewer nodes and beams and lower MU to achieve same or better dose distribution than MLP. In other words, the treatment time was shortened without Fig. 4 DVHs of the first liver and brain cases. The AD-S plans were plotted in solid lines and MLP plans in dash lines jeopardizing the treatment quality. This was because the penalty on the numbers of beams and nodes was modeled with the lasso and group lasso terms and took effect during optimization. The MLP system used an alternative method, which is to directly eliminate the beams and nodes with marginal weights after optimization, i.e., the optimization and reduction procedures were performed separately. It was plausible that the reduction procedure would deteriorate the optimization results significantly. So was the case of the roundended PTV DVH curve (lower minimum dose) of the brain case, which was not clinically desirable, especially for SRS treatment. The VOLO optimizer has also overcome this limitation by integrating beam reduction with optimization.
The AD-S approach used the same sets of cones with MLP system, which was selected empirically by experienced physicist. How to choose the optimal cones were not covered in this study. Intuitionally, the optimal cone size is directly related with the size and geometric irregularity of the PTV. Considerable researches have been published [24][25][26], which showed that the quality of treatment plan would be improved even with simple empirical formula or mechanism. However, this issue has not been fully addressed, which could be explored potentially with the DL technique. The fast AD-S approach proposed makes it possible to test more cases with varying complexities and fulfill the task of big-dataset construction for further DL research.

Conclusion
We have developed an AD-S approach for circular collimator based robotic radiotherapy treatment plan optimization. The optimization model was topologized into a simple neural network, and the automatic differentiation approach was adopted to solve the model. The lasso and group lasso regularization terms were utilized for beam and node reduction. Our investigation in brain and liver cases demonstrated that the AD-S approach could quickly achieve at least comparable treatment plans using fewer beams, nodes and lower MU.

Additional file 1.
The key codes of using pytorch for optimization.