A reinforcement learning application of guided Monte Carlo Tree Search algorithm for beam orientation selection in radiation therapy

Due to the large combinatorial problem, current beam orientation optimization algorithms for radiotherapy, such as column generation (CG), are typically heuristic or greedy in nature, leading to suboptimal solutions. We propose a reinforcement learning strategy using Monte Carlo Tree Search capable of finding a superior beam orientation set and in less time than CG.We utilized a reinforcement learning structure involving a supervised learning network to guide Monte Carlo tree search (GTS) to explore the decision space of beam orientation selection problem. We have previously trained a deep neural network (DNN) that takes in the patient anatomy, organ weights, and current beams, and then approximates beam fitness values, indicating the next best beam to add. This DNN is used to probabilistically guide the traversal of the branches of the Monte Carlo decision tree to add a new beam to the plan. To test the feasibility of the algorithm, we solved for 5-beam plans, using 13 test prostate cancer patients, different from the 57 training and validation patients originally trained the DNN. To show the strength of GTS to other search methods, performances of three other search methods including a guided search, uniform tree search and random search algorithms are also provided. On average GTS outperforms all other methods, it find a solution better than CG in 237 seconds on average, compared to CG which takes 360 seconds, and outperforms all other methods in finding a solution with lower objective function value in less than 1000 seconds. Using our guided tree search (GTS) method we were able to maintain a similar planning target volume (PTV) coverage within 1% error, and reduce the organ at risk (OAR) mean dose for body, rectum, left and right femoral heads, but a slight increase of 1% in bladder mean dose.


I. Introduction
Radiation therapy is one of the main modalities to cure cancer, and is used in over half of cancer treatments, either standalone or in conjunction with another modality, such as surgery or chemotherapy.For intensity-modulated radiation therapy (IMRT), the patient body is irradiated from fixed beam locations around patient body, and the radiation field is modulated at each beam position using multi-leaf collimators (MLC).In IMRT, the optimal choice of beam orientations has a direct impact on the treatment plan quality, influencing the final treatment outcome, hence patient quality of life.Current clinical protocols either have the beam orientations selected by protocol or manually by the treatment planner.Beam orientation optimization (BOO) methods solve for a suitable set of beam angles by solving an objective function to a local minimum.BOO has been studied extensively in radiation therapy procedures, for both coplanar 1,2,4,6,7,8,10,11,12,14,15,22,23,24,25,26,29,33,34,35,38,39,42,48 , and noncoplanar 4,5,8,15,26,27,29,32,33,36,45,46,47,49 IMRT, or intensity-modulated proton therapy 17,18,30,43 (IMPT) by researchers in the past three decades.However, BOO has not been widely adopted due to their high computational cost and complexity, since it is a large-scale NP hard combinatorial problem 3,49 .
Despite the extensive research, the lack of practical clinically beam orientation selection algorithms still exists due to the computational and time intensive procedure, as well as the sub-optimality of the final solution, and BOO remains a challenging step of the treatment planning process.
To measure the quality of the BOO solution, it is necessary to calculate dose influence matrices of each potential beam orientation.Dose influence matrices for one beam associates all the individual beamlets in the fluence map with the voxels of the patient body.This calculation is time consuming and requires a large amount of memory to use in optimization.To mange the limited capacity of computational resources, the treatment planning process, after defining the objective function, is divided into two major steps: 1) find a suitable set of beam orientations, and 2) solve the fluence map optimization problem (FMO) 10 of those selected beams.However, these two steps are not independent of each other-the quality of BOO solution can be evaluated only after FMO is solved, and FMO can be defined only after BOO is solved.Due to the non-convexity and large scale of the problem, researchers consider dynamic programming methods by breaking the problem into a sequence of smaller problems.One of the successful algorithms specially for solving complex problems such as BOO is a method known as Column Generation (CG).In the original application of CG into radiotherapy, Romeijn et al. 37 solved a direct aperture optimization (DAO) problem by using CG.Dong et al. 16 then proposed a greedy algorithm based on column generation, which iteratively adds beam orientations until the desired number of beams are reached.Rwigema et al. 40 use CG to find a set of 30 beam orientations to be used in 4π treatment planning of stereotactic body radiation therapy (SBRT) for patients with recurrent, locally advanced, or metastatic headand-neck cancers, to show the superiority of 4π treatment plans to those created by volumetric modulated arc therapy (VMAT).Nguyen et al. 28 used CG to solve the triplet beam orientation selection problem specific to MRI guided Co-60 radiotherapy.Yu et al. 47 used an in-house CG algorithm to solve an integrated problem of beam orientation and fluence map optimization.
However, CG is a greedy algorithm that has no optimality guarantee, and typically yields a sub-optimal problem.In addition, CG still takes as much as 10 minutes to suggest a 5 beam plan for prostate IMRT.The aim of this work is to find a method to explore a larger area of the decision space of BOO in order to find higher quality solutions than that of CG in a short amount of time.The proposed method starts with a deep neural network that has been trained using CG as a supervisor.This network can mimic the behavior of CG by directly learning CG's fitness evaluations of the beam orientations in a supervised learning manner.The efficiency of this supervised network, which can propose a set of beam angles that are non-inferior to that of CG, within less than two seconds, is presented in our previous work 41 .Given a set of already selected beams, this network will predict the fitness value of each beam, which is how much the beam will improve the objective function when added in the next iteration.
In this study, we extend our previous work, and combine this trained supervised learning (SL) network with a reinforcement learning method, called Monte Carlo tree search.We use these fitness values from the SL network as a guidance to efficiently navigate action space of the reinforcement learning tree.Specifically, it provides the probability of selecting a beam in the search space of the tree at each iteration, so that the beam with the better likelihood to improve the objective function has the higher chance of being selected at each step.To evaluate our proposed method, we compare its performance against the state-of-the-art CG.We developed three additional combinations of the guided and random search tree approaches for comparison.

II. Methods
The proposed method has a reinforcement learning structure involving a supervised learning network to guide Monte Carlo tree search to explore the beam orientation selection decision space.This method, guided Monte Carlo tree search (GTS), consists of two main phases: 1) Supervised training a deep neural network (DNN) to predict the probability distribution of adding the next beam, based on patient anatomy, and 2), using this network for a guided Monte Carlo tree search method to explore a larger decision space more efficiently to find better solutions.For the first phase we use the CG implementation for BOO problem, where CG iteratively solves a sequence of Fluence Map Optimization (FMO) problems 10 by using GPU-based Chambolle-Pock algorithm 13 , the results of the CG method are used to trained a supervised neural network.For the second phase, which is the main focus of this work, we present a Monte Carlo Tree Search algorithm, using the trained DNN.Each of these phases are presented in the following sections.

II.A. Supervised Learning of the Deep Neural Network
We develop a deep neural network (DNN) model that learns from column generation how to find fitness values for each beam based on the anatomical features of a patient and a set of structure weights for the planning target volume (PTV) and organs-at-risk (OAR).The CG greedy algorithm starts with an empty set of selected beams, calculates the fitness values of each beam based on the optimality condition of the objective function shown in Equation 1.
where w s is the weight for structure s, which are pseudo randomly generated between zero and one during the training process to generate many different scenarios.The value, p, is the prescription dose for each structure, which is assigned 1 for the PTV and 0 for OARs.At each iteration of CG, fitness values are calculated based on KarushKuhnTucker (KKT) conditions 20,21 of a master problem, and they represent how much improvement each beam can make in the objective function value.The beam with the highest fitness value is selected to be added to the selected beam set, S.Then, FMO for the selected beams is performed, which affects the fitness value calculations for the next iteration.The process is repeated until the desired number of beams are selected.The supervised DNN learns to mimic this behavior through the training of the DNN is shown in figure 1.Once trained, this DNN is capable of efficiently providing a suitable set of beam angles in less than 2 seconds, as opposed to the 360 seconds required to solve the same problem using CG.The details of the DNN structure and its training process is described in our previous work 41 .
Patient anatomical features include the contoured structures (organs at risk) of the images from patients with prostate cancer and the treatment planning weights assigned to each structure.

II.B. Monte Carlo Tree Search
The pre-trained DNN probabilistically guides the traversal of the branches on the Monte Carlo decision tree to add a new beam to the plan.Each branch of the tree starts from root as an empty set of beams, and continues until it reaches the terminal state.After the exploration of each complete plan (selection of 5 beams in our case), the fluence map optimization problem is solved and and based on that, the probability distribution to select next beam will be updated, using the reward function, in the backpropagation stage.Then, starting from root the exploration of the next plan will begin until the stopping criteria is met. Figure 2 shows an example of a tree search, which has discovered seven plans so far.

II.B.1. Basics of Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) uses the decision tree to explore the decision space, by randomly sampling from it 9 .The search process of MCTS consists of four steps: 1) node selection, 2) expansion, 3) simulation, and 4) back-propagation on the simulation result.To explain these processes in detail, there are some properties that need to be defined first, these definitions are as follows: State of the problem: include patients anatomical features and a set of selected beam orientations (B).At the beginning of the planning, this set has no member, and it is updated throughout the solution procedure.
Actions: the selection of the next beam orientation to be added to set B, given the state of the problem.
Solution or terminal state: state of the problem in which the number of selected beam orientations (size of B) is the same as a predefined number of beams (N ), chosen by user.At this point, a feasible solution for the problem is generated.
Decision Tree: The solution space of a set of discrete numbers-in this work discrete numbers are the beam orientations-specially with iterative structures, can be defined as a tree, where each node and branch represent the selection of one number or a subset of available numbers, respectively.
Node (Y ): selection of one potential beam orientations is a node.
Root (O): a node with empty set of beam orientations, every solution start from the root.
Path: a unique connected sequence of nodes in a decision tree.

II.B. Monte Carlo Tree Search
Branch (Q): a path originated from Root node.Each branch represents the iterative structure of the proposed method.The length of a branch is the number of nodes in that branch.In this work solution is a branch with size N + 1.There is only one branch from each root to any node in a tree.
Leaf: last node of a branch.There is no exploration of the tree after a leaf is discovered.
Internal node: any node in a tree except for root and leaves.
The selection process in the proposed method is guided by a pre-trained DNN as described in subsection II.A..This DNN is used to probabilistically guide the traversal of the branches on the Monte Carlo decision tree to add a new beam to the plan.At each node-starting by root note-the DNN is called to predict an array of fitness values for each beam orientation(P ).An element of this array P [i] represents the likelihood of the selection of the i th beam orientation.For example, if the number of potential beam orientations is 180, with 2 • separation, Y would be an array of size 180, and P [2] is the likelihood of selecting beam orientation in 2 nd position of the potential beam orientations, P [2] = 4 • .The expansion process happens after selection process at internal nodes, to further explore the tree and create children nodes.The traversal approach in the proposed method is depth first, which means that the branch of a node, that is visited or created for the first time, continues expanding until there are N + 1 nodes in a branch.In this case, selection and expansion processes are overlapping because only one child node is created or visited at a time, although a node can be visited multiple times and several children can be generated from one node, except for leaf.The leaf node does not have any children.Nodes in a branch must be unique, it means that a branch of each external node (Q) can be expanded only to nodes that are not already in the branch.In fact, beam orientation optimization problem can be defined as a dynamic programming problem with the following formula: where S is a set of indices for previously selected beams, k is index of currently selected beam and is a subset of beams to be selected that has the highest reward value.Each simulation consists of iteratively selecting a predefined number of beams (N ), in this work N = 5.After the exploration of each complete plan, the fluence map optimization problem is solved and used for the back-propagation step, which is used to update the probability distribution for beam selection.

II.B. Monte Carlo Tree Search
Guided Monte-Carlo Tree Search for BOO, April 15, 2020 page 7

II.B.2. Main Algorithm
The detailed of the guided Monte Carlo tree search algorithm in the form of a pseudo code is provided in Algorithm 1. Several properties of each node in the proposed tree are being updated after the exploration of final states.To simplify the algorithm, these properties are addressed as variables and the following is a list of them: After a leaf is discovered, an FMO problem associated with the beams of that branch will be solved, the value of the FMO cost function is the value associated with its corresponding leaf.The cost value of all other nodes (other than leaves) in a tree is the average cost of its sub-branches.For example in Figure 2  set selected beam as B ← ∅ an empty set, best cost value as infinity (V * ← ∞), and best selected beam as create a root node object (O) with the following properties: given the set B as input to DNN, predict an array of fitness values and assign it to root node Y # P ← P rd(DNN, B) predict an array of fitness values F (DNN, B) assign predicted values to new node (YP ← P rd(DNN, B)) 19: end if procedure Reward Calculations

29:
solve F M O given set B and save as Y # V ← F mo(B)

II.C. Algorithms for performance comparison
In general, four frameworks were designed to show the efficiency of the proposed GTS method compared to others.These models are defined as: Guided Tree Search (GTS) : As presented in Algorithm 1, used a pre-trained policy network to guild a Monte-Carlo decision tree.
Guided Search (GuidS) : Used the pre-trained network to search the decision space by iteratively choosing one beam based on the predicted probabilities from the policy network.Unlike GTS, the search policy is not updated as the search progresses.This process is detailed in Algorithm 2.
Randomly sample Tree Search (RTS) : This algorithm is simple Monte-Carlo tree search method which starts with a uniform distribution to select beam orientations (randomly select them), and then update the search policy as the tree search progresses.Note that all of tree operations used in GTS is also used in this algorithm, except for having a policy network to guide the tree.This method is proposed to show the impact of using DNN to guild the decision tree.
Random Search (RandS) : This method searches the decision space with uniformly random probability until stopping criteria is met.It randomly selects 5 beam orientations and solves the corresponding FMO problem.The search policy is not updated.Its procedure is close to Algorithm 2 where the "Select B using DNN" procedure is replaced by randomly selecting 5 unique beams.

Algorithm 2 Guided Search algorithm(GuidS)
Input: Pre-trained DNN 1: initialize B as an empty array, best cost value as infinity (V * ← ∞), and best selected beam as B * ← ∅ 2: set current number of beam orientations in B as 0, NB ← 0 3: set stop ← F alse 4: while stop = F alse do if stopping criteria is met then The DNN trained over 400 epochs, each with 2500 steps and batch size of one.
The performances of four methods GTS, GuidS, RTS, and RandS, explained in section subsection II.C., are evaluated.Two of these methods, GTS and GuidS, use the pre-trained DNN as a guidance network.We originally had the images of 70 patients with prostate cancer, and used the images of the 57 of them to train and validate DNN and therefore cannot be used for the

II.D. Data
Guided Monte-Carlo Tree Search for BOO, April 15, 2020 page 11 testing in this project1 .There are 13 patients that DNN has never seen before and the images of those patients are used in this project as test set.Multiple scenarios can be generated for each patient, based on the weights assigned to patient's structures for planning their treatments.We semi-randomly generated 10 sets of weights for each patients.In total, we have a total of 130 test plans among the 13 test patients for the comparison.All the tests in this paper were performed on a computer with an Intel Core I7 processor@3.6GHz, 64 GB memory, and an NVIDIA GeForce GTX 1080 Ti GPU with 11 GB video memory.
The structure weight selection scheme is outlined by the following process: 1.In 50% of the times, a uniform distribution in the range of 0 to 0.1 is used to generate a weight for each OAR separately.
2. In 10% of the times, the smaller range of 0 to 0.05 is used to select weights for OARs separately, with uniform distribution.The weights range from 0 to 1.This weighting scheme was found to give a clinically reasonable dose, however, the dose itself may not be approved by the physician for that patient.
Finally, considering only test scenarios, FMO solutions of beam sets generated by CG and by the 4 tree search methods were compared with the following metrics: PTV D 98 , PTV D 99 : The dose that 98% and 99%, respectively, of the PTV received PTV D max : Maximum dose received by PTV, the value of D 2 is considered for this metric PTV Homogeneity: P T V D 2 −P T V D 98 P T V D 50 where P T V D 2 and D 50 are the dose received by 2% and 50%, respectively, of PTV Paddick Conformity Index (CI P addick ) 31,44 : (V P T V ∩V 100%Iso ) 2 V P T V ×V 100%Iso where V P T V is the volumne of the PTV and V 100%Iso is the volume of the isodose region that received 100% of the dose High Dose Spillage (R 50 ): V 50%Iso V P T V where V 50%Iso is the volume of the isodose region that received 50% of the dose (a) (b) Figure 3: The rate at which each method successfully found a solution with lower objective function value than that of CG solution.Each attempt is limited to 1000 seconds.3a The percentages of test cases that each method found a solution better than CG solution in at least 1 of their 5 attempts.3b The percentage of test cases that each method found a solution better than CG solution, averaged over all 5 of their attempts to solve the problem.

III. Results
At each attempt to solve a test scenario, each method is given 1000 seconds to search the solution space.Whenever a method finds a solution better than that of CG, the solution and its corresponding time stamp and the number of total solutions visited by this method are saved.We use these values to analyze the performance of each method.The best solution that is found in each attempt to solve the problem is used as the final solution of that attempt and is used for PTV metrics calculation.The average objective function value of final solutions in five attempts are used for the comparison of the performance of four methods with CG solution.At first we compare the efficiency of the four methods of GTS, GuidS, RTS and RandS.Although the main purpose of these methods are to find a solution better than CG, there were some cases that none of these method could beat CG solution, either CG solution was very close to optimal, or there were several local optimums with wide search space which makes it very difficult to explore it efficiently, this is especially true for RTS and RandS methods.The percentages of total number of attempts that each method could successfully find a solution better than CG in at least one of five attempts and averaged over all attempts to solve the problem are presented in Figure 3a and 3b, respectively.
Note that for this test the stopping criteria was 1000 seconds of computational time.As we expected, GTS and GuidS that are using the pre-trained DNN performed better than the other two methods.However, there are still cases that they are not able to find a better solution than CG.

GTS
GuidS RandS UTS Methods If a method finds a solution better than CG, Distance measure will be positive, and for cases that a method was not successful to find a solution better than CG solution, this value will be negative.
It means a method with largest Distance measure found solutions with better qualities-with an objective value smaller than that of CG-and therefore more efficient in the limited time of 1000 seconds.Figure 4 shows the box plot of Distance measure for GTS, GuidS, RTS, and RandS methods.Based on this figure, the average Distance measure for GTS method is 2.48, which is the highest compared to 1.79 for GuidS, 0.67 for RTS, and 0.81 for RandS.
For the computational time evaluation and comparison methods, only the 102 out of 130 test cases, where all four methods had successfully found a better solution than CG within the 1000  The computational time comparison between GTS, GuidS, RandS and RTS.The first time that each method beats a column generation solution.For this measure only successful scenarios are considered.Mean, standard deviation(at the top of each box as mean pm standard deviation) and median (in the middle of the box) are calculated for each method.

GTS
seconds time limit, were considered.The box-plot of the best time needed to beat CG solution is presented in Figure 5a.It represents the first time that a method finds a solution better than CG solution within 1000 seconds among all five attempts.The box-plot of the average time to beat CG among all attempts to solve each test-cases is provided in Figure 5b.To measure the average time to beat CG, all test-cases and all attempts are considered.The fastest method considering this measure is GTS, the average time to beat CG for GTS method is 195 seconds in best case and 237 seconds in total cases.The second fast method is GuidS with the average time of 227 seconds for best and 268 seconds for total cases.RandS outperforms RTS with best time of 337 seconds compared to 364 seconds of RTS.Interestingly, the average total time of RTS and RandS is less than the average of best cases.
To study the statistical significant of GTS method compared to other four methods (GuidS, RTS, RandS, and CG), we use one-tailed paired sample t-test to compare the objective function values of each pair of methods, and Distance measure.The null hypothesis is that the average objective function and Distance measure of all methods are the same.If we show the null hypothesis as GT S = GuidS = RandS = RT S = CG = 0.The alternative hypothesis can be described as GT S < GuidS < RandS < RT S < CG for the objective value parameters and GT S > GuidS > RandS > RT S > CG for Distance measure.Ten paired sample t-test are performed for objective function values and Distance measure.These statistics are presented in Table 1.The distribution   6 shows the dose-volume and dose-wash of plans generated by GTS and CG for one test-case scenario.

IV. Discussion
In this research, we propose an efficient beam orientation optimization framework capable of finding a improved solution over CG, in a similar amount time, by utilizing a reinforcement learning structure involving a supervised learning network, DNN, to guide Monte Carlo tree search to explore the beam orientation selection decision space.
Although CG is a powerful optimization tool, it is a greedy approach that not only is computationally expensive and time consuming, but it also may get stuck in a local optimum.This is particularly true for highly non-linear optimization problems with lots of local optima, such as BOO.In this work, we tried 4 different approaches: 1) Guided Tree Search (GTS), 2) Guided Search (GuidS), 3) Random Tree Seach (RTS), and 4) Random Search (RandS).It is shown that although the quality of solutions using RandS, RTS and CG were not significantly different, in 50% of test-cases both RandS and RTS, which have no knowledge of the problem at the beginning of the search, can find solutions better than CG.This shows the high potential of improving the solution found by CG.
We saw that GTS and GuidS, both performs better than other methods, which is expected because both of these methods are using a prior knowledge (trained DNN) to explore the solution space.GTS even outperforms GuidS on average, since GTS is a combination of GuidS and RTS, means adding a search method to GuidS can improve the quality of the solution.But considering the insignificant difference in the performance of RTS and RandS, adding any search method to GuidS may results in better solutions and it may not be directly related to RTS.This issue will be studied in future researches.The poor performance of RTS may also suggest that using uniform tree search is too slow to converge to the optimal selection of beams.
Although GTS performs better than others to find solutions with better objective function values, but the dose spillage metrics, and specifically the average dose received by bladder in GTS plans can be improved further.Considering the success of GTS in reducing the objective function and its potential for further improvement, we will continue exploring new methods and techniques to upgrade the quality of treatment planning with the help of artificial intelligence.
We should note that CG is a greedy and deterministic algorithm, therefor using CG on the same problem always results in the same solution.This is of completely different nature from our search methods and it may not be fair to compare its performance with the four search procedures, which, given infinite time and resources, can act as brute forced approach and guarantee finding the optimal solution.However, our main goal is to find the best possible solution for BOO problem, and in this work we try to see which search algorithms can find a better solutions and faster.Hence we expect to see search algorithms outperform greedy algorithms.The results showed us that the objective function value of GTS and GuidS CG, RTS and RandS perform similarly.Even though, plans generated by DNN solutions may not be superior to CG, but it can mimic the CG algorithm very efficiently 41 and is a very successful tool to explore the search space as shown by GuidS and GTS, especially when compared to CG , which can easily exhaust the computational resources and is very slow to find one solution for a problems.The good performance of CG compared to RandS and RTS shows how powerful the CG method can be to find a solution, and the success of using DNN to explore the decision space represents the proper knowledge that can be achieved by learning from CG.
Finally, GTS is a problem specific search method that needs to be applied on each test-cases separately.To use the knowledge that we can get from the GTS performance, more advanced reinforcement algorithms can be trained to create a single general knowledge-based method that is not only very powerful to find the best possible solutions, but also very fast for doing so.The advanced reinforcement learning method then can be easily applied on more sophisticated and challenging problems such as Proton and 4π radiation therapy.For future studies, we are working to develop a smart, fast and powerful tool to be applied in these problems.

V. Conclusion
In this study, we proposed a method combined of two main components.First, a supervised deep neural network (DNN) to learn column generation (CG) decision-making process, and predict the fitness values of all candidate beams, beam with maximum fitness value will be chosen to add to the solution.CG, although powerful, is a heuristic, greedy algorithm that cannot guarantee the global optimality of the final solution, and leaves room for improvement.A Monte Carlo guided tree search (GTS) is proposed to see if finding a solution with better objective function in reasonable amount of time is feasible.After the DNN is trained, it is used to generate the beams fitness values for nodes in the decision tree, where each node represents a set of selected beams.Fitness values in each node are normalized and used as probability mass function to help deciding decision tree extension.Later probability distribution of beam selection will be modified by reward function, which is based on the solution of FMO problem FMO for every five selected beams.GTS continues to explore the decision tree for 1000 seconds.Along with GTS, three other approaches are also tested, GuidS which is also using DNN to select beams iteratively, but it does not update the probability distribution of beams during the search process.RTS which is a simple tree search algorithm, starts by randomly sampled from a uniform distribution of beam orientations for each node and continues to update beams probability distribution based on the tree search approach presented for GTS.And finally RandS which is randomly select beams, the most trivial and simple approach.

VI. Appendix
Although statistically with 130 number of test cases we can assume that approximately our metrics follow normal distribution, but based on the graphs of Figure 7a, this assumption may not be practical.Because of this, we introduce Distance measures to normalize our metrics.The probability distribution and cumulative probability mass function of Distance measure are presented in Figure 7b.By this graph we can verify that the this measure approximately normally distributed with similar standard deviation.

Figure 1 :
Figure 1: Schematic of the Supervised Training Structure to predict Beam Orientation fitness values.Column Generation (CG) as teacher and deep neural network (DNN) as Trainee.

Figure 2 :
Figure 2: An example of guided tree search, subscript are order that a node is generated, and superscript is the depth of the node in the tree.
the cost value of node b 1 1 is the average cost of nodes b 2 2 and b 6 2 .probability distribution (Y P ): an array of size 180 (the number of potential beam orientations), where i th element of this array represents the chance of improvement in the current cost value if tree branches out by selecting i th beams.After a node is discovered in the tree, this distribution is populated by using DNN.After the first discovery of a node, Y P is updated based on the reward values.Reward (Y R ): is a function of the node's cost values and the best cost value ever discovered in the search process.The reward values would be updated after each cost calculation and are calculated and updated by the reward calculation procedure defined in line 28 of Algorithm 1. Depth (Y D ): is simply the number of beam orientations selection after node Y is discovered.Name (Y id ): a unique string value as id for each node, this value is the path from root to node Y .Beam Set (Y B ): the set of beams selected for a branch started from root and ended in node Y .Parent (Y parent ): the immediate node before node Y in a branch from root to Y , except for root node, all other nodes in a tree have one parent.Children (Y children ) : the immediate node(s) of the sub-branches from the node Y , except for leaves, all other nodes in a tree have at-least one children.II.B.Monte Carlo Tree Search Algorithm 1 Select N beam orientations from M candidate beams 1: procedure Initialization 2:

7 : 8 : 9 :
predict an array of fitness values P = F (DNN, B) Select next node (b) with the probability of P (b) Update B: B = B ∪ {b} and NB = NB + 1 solve F M O given set B and save as V ← F mo(B) 14: if V < V * then 15: V * ← V and B * ← B 16: end if 17: end while 23: return B * and V * II.D. Data We used images from 70 patients with prostate cancer, each with 6 contours: PTV, body, bladder, rectum, left femoral head, and right femoral head.Additionally, the skin and ring tuning structures were added during the fluence map optimization process to control high dose spillage and conformity in the body.The patients were divided randomly into two exclusive sets: 1) a model development set, which includes training and validation data, consisting of 57 patients, 50 for training and 7 for validation, for cross-validation method, and 2) a test data set consisting of 13 patients.Each patients data contains 6 contours: PTV, body, bladder, rectum, and left and right femoral heads.Column generation was implemented with a GPU-based Chambolle-Pock algorithm 13 , a first-order primal-dual proximal-class algorithm, to create 6270 training and validation scenarios (22 5-beam plans for each of 57 patients) and 130 test scenarios (10 5-beam plans for each of 13 test patients).

Figure 4 :
Figure 4: The distance measure of the average of the best objective function value found by each method compared to CG solution.The value written inside the box is the median and the value at the top of each box-plot is the mean pm standard deviation of the Distance measure.
The best time to beat CG solution.The average time to beat CG.

Figure 5 :
Figure5: The computational time comparison between GTS, GuidS, RandS and RTS.The first time that each method beats a column generation solution.For this measure only successful scenarios are considered.Mean, standard deviation(at the top of each box as mean pm standard deviation) and median (in the middle of the box) are calculated for each method.
The distribution of objective values using GTS, GuidS, RTS, RandS, and CG methods.
Distance distribution of methods GTS, GuidS, RTS, and RandS.

Figure 7 :
Figure 7: The distribution of objective values and distance to CG objective values 15

Table 1 :
One-tailed paired sample t-test to test the average objective function value and Distance measure for every pairs of CG, GTS, GuidS, RTS, and RandS methods, with %99 confidence intervals.All values in red have p-values greater than 0.01.)

Table 2 :
19an ± standard deviation for PTV Statistics, Paddick Conformity Index (CI P addick ), and High Dose Spillage (R 50 ) of methods: CG, GTS, GuidS, RTS, and RandS.PTV statistics, Paddick Conformity Index(CI P addick ) and dose spillage(R 50 ) of plans generated by CG, GTS, GuidS, RTS, and RandS are presented in Table2.Note that PTV D 2 is used to measure PTV D max , as recommended by the ICRU Report 8319.The plans generated by all methods have very similar PTV coverage.CG plans have the highest CI P addick followed by GTS and GuidS plans.While CG and GuidS plans have the lowest dose spillage value followed by GTS.The average and maximum dose received by each structure are provided in Table3, these values reflect the fractional dose in plans generated by each method with the assumption that the prescription dose is one-e.g. if the prescription dose is 70 Gy, the average dose of 0.207 in the table means 14.47Gy (0.207 × 70) in the prescribed plan.The minimum values in each row are of Objective value and Distance measures are provided in appendix section VI..As highlighted by red, all pairs of the three methods of CG, RTS, and RandS have p-values greater than 0.01, and are significantly different, while the average Distance and objective value measures of GuidS and GTS are significantly different.Based on these results GTS outperforms all other methods significantly while in the second position is GuidS as was expected.

Table 3 :
The average and maximum fractional dose received by each structure in plans generated by GTS, GuidS, RTS, RandS, and, CG methods, where prescription dose is set to 1.shown as bold numbers for easier interpretation of the table.On average plans generated by GTS have lower mean dose to OARs compared to other methods, while plans generated by CG have the lowest maximum dose to OARs.GTS plans spare rectum and right femoral head better than other methods.Although the average fractional dose to bladder by by GTS plans (0.207) is more than