Systematic comparison of the statistical operating characteristics of various Phase I oncology designs

Dose finding Phase I oncology designs can be broadly categorized as rule based, such as the 3 + 3 and the accelerated titration designs, or model based, such as the CRM and Eff-Tox designs. This paper systematically reviews and compares through simulations several statistical operating characteristics, including the accuracy of maximum tolerated dose (MTD) selection, the percentage of patients assigned to the MTD, over-dosing, under-dosing, and the trial dose-limiting toxicity (DLT) rate, of eleven rule-based and model-based Phase I oncology designs that target or pre-specify a DLT rate of ∼0.2, for three sets of true DLT probabilities. These DLT probabilities are generated at common dosages from specific linear, logistic, and log-logistic dose-toxicity curves. We find that all the designs examined select the MTD much more accurately when there is a clear separation between the true DLT rate at the MTD and the rates at the dose level immediately above and below it, such as for the DLT rates generated using the chosen logistic dose-toxicity curve; the separations in these true DLT rates depend, in turn, not only on the functional form of the dose-toxicity curve but also on the investigated dose levels and the parameter set-up. The model based mTPI, TEQR, BOIN, CRM and EWOC designs perform well and assign the greatest percentages of patients to the MTD, and also have a reasonably high probability of picking the true MTD across the three dose-toxicity curves examined. Among the rule-based designs studied, the 5 + 5 a design picks the MTD as accurately as the model based designs for the true DLT rates generated using the chosen log-logistic and linear dose-toxicity curves, but requires enrolling a higher number of patients than the other designs. We also find that it is critical to pick a design that is aligned with the true DLT rate of interest. Further, we note that Phase I trials are very small in general and hence may not provide accurate estimates of the MTD. Thus our work provides a map for planning Phase I oncology trials or developing new ones.


Introduction
Phase I trials of a new anti-cancer drug are usually single arm, open label studies conducted on a small number (10s) of cancer patients, many of whom do not respond any longer to the standard treatment. Due to the toxic nature of many anti-cancer drugs as well as due to ethical reasons, cancer patients are enrolled in Phase I oncology trials, as opposed to the healthy volunteers used in Phase I trials in other therapeutic areas.
The main aim of a Phase I oncology trial is to investigate and understand the toxic properties (safety) of the new anti-cancer drug; the drug's efficacy is not traditionally the focus, although the drug's efficacy is often observed and monitored by the oncologist. With regard to safety, the trial helps investigators determine the right dose and dosing interval as well as the best route of administration of the new drug. In order to determine the right dose, an endpoint such as Phase 1 dose limiting toxicities (DLTs) in the first cycle is often considered.
For each dose finding Phase I trial, a set of pre-defined adverse events, typically only those possibly related to taking the study drug, constitutes the DLTs for that trial. Patients are traditionally monitored for DLTs during the first cycle of administration of the new anti-cancer drug; however, more recent trials may monitor DLTs for a longer period and may include toxicities in the DLT definition that are not included in the conventional definition of DLTs [1]. The starting dose in these dose finding trials is usually a very conservative dose based on animal studies of the drug, and the subsequent increasing doses to be administered are pre-specified. The number of patients with DLTs in each dose level is used to determine the Maximum Tolerated Dose (MTD). For a single anticancer drug being tested, the MTD is usually the highest dose level at which the observed DLT rate is equal to or below a specified percent. Phase II patients are generally dosed at the MTD determined in the corresponding Phase I trial. The above method for MTD selection is more applicable to cytotoxic agents where the toxicity and efficacy are assumed to increase monotonically with dose than to some modern molecularly targeted therapies where the MTD may not be reached even at higher doses due to their low toxicity; in such cases, another appropriate dosing endpoint may need to be considered such as the dose at which the key pharmacokinetic and pharmacodynamics parameters are optimal [1,2].
Dose finding Phase I oncology designs can be broadly categorized [3e6] as rule based (such as the 3 þ 3 design) or model based (such as the CRM [7] and Eff-Tox designs [8]). The 3 þ 3 design has been the workhorse dose finding design for Phase I oncology trials for a long time. It is still commonly used due to its simplicity and ease of implementation. However, depending on the target DLT rate of interest, it can be slow and inaccurate in estimating the MTD and can lead to a large portion of patients receiving sub-therapeutic doses that do not produce any clinically meaningful response [9]. Hence, other designs, including model-based designs, have been explored in recent years [10e12].
The establishment of the MTD for various Phase 1 oncology designs is the main focus of this paper. In this work, we explore extensions of the 3 þ 3 design as well as the model based mTPI [13], TEQR [14], BOIN [15], CRM [7,3] and EWOC [16,17] designs and compare their performance. There is no unique criterion to evaluate these designs since the performance of each design depends on the true DLT probability at each dose and the target DLT rate of the design. Hence, we systematically compare several statistical operating characteristics for the true DLT rates generated at the same doses by three different dose-toxicity curves. In addition, we explore the effect of starting the trial at different dose levels below the true MTD on the accuracy of MTD selection in these designs. The 3 þ 3 design and its extensions we consider target a DLT rate of 0.2, and we specify a target DLT rate of 0.2 for the model based designs we consider. Although the results in this paper focus on a target DLT rate of 0.2, we explain in the discussion section the implications of targeting other DLT rates such as 0.1 and 0.33 with the A þ B designs considered and discuss other A þ B designs that target these rates. We also study the performance of the model based designs considered when the target DLT rate specified is 0.1 and 0.33. In contrast to previous works that compare a limited number of specific designs [18], our comprehensive comparison across several designs should serve as a practical aid in applying these Phase I oncology designs or in developing new ones.

Rule based designs
Weconsiderthe3 þ3 design,whichtargets a DLTrate of~0.2 [19],as well as its various extensions that target a DLT rate of~0.2. We also include the simple accelerated titration design and the 3 þ 3þ3 design in our study (Table 1) [20e22]. We then investigate several of their statistical operating characteristics, such as the accuracy of MTD selection among others. The formal definition of the MTD is that it is the dose for which Probability(DLTjdose ¼ d) ¼ target probability.
For the A þ B designs [23] that allow only escalation, the algorithm that we follow is [21]: We then estimate the MTD to be the dose level immediately below the last dose level examined. For the standard 3 þ 3 design (Table 1), which is a special case of the general A þ B design, this implies that the MTD is estimated to be the highest dose in which fewer than 33% of patients experience a DLT.
For the A þ B designs that also allow de-escalation, the algorithm that we follow is: For the 3 þ 3 design with de-escalation, the MTD is estimated to be the highest dose in which fewer than 33% of patients experience a DLT, and in which at least six participants have been treated with the study drug.
For the rule-based designs where no de-escalation is allowed, Table 1 describes the dose finding rules; the specific x, y, and z for each A þ B design can be determined based on the description of these designs in Table 1. To provide a preliminary idea of the properties of these designs, we depict in Fig. 1 the probability of not escalating for a single step for various true DLT rates for the escalation only designs considered. For example, for the 3 þ 3 design that allows only escalation, we can escalate at each step or dose level if 1) 0 out of 3 patients experience a DLT or if 2) 1 out of 6 patients experiences a DLT; the probability of escalating at each step or dose level is q 3 þ3pq 5 and not escalating at each step is 3p 2 q þ p 3 þ9p 2 q 4 þ9p 3 q 3 þ3p 4 q 2 , where p is the probability of experiencing a DLT at the current dose level and q ¼ 1-p. Using these two probabilities and extending the framework to any number of steps, we can then calculate analytically the probability of selecting any dose level as the MTD for the 3 þ 3 as well as other A þ B designs that allow only escalation (see Lin, 2001 [24] and Appendix Table 1). This reference [24] also provides analytic formulae for the probability of MTD selection for the 3 þ 3 and other A þ B designs that allow de-escalation as well.

Model based designs or designs that allow specification of the target DLT rate
In terms of model-based designs, we consider the Modified Successively assign a single patient at each dose level until the patient has a DLT. Then switch to the 3 þ 3 design (i.e. add 2 more patients to the dose level at which a DLT is first seen and then follow the rules of the 3 þ 3 design).
The table above provides the rules for the escalation only designs but we also allow de-escalation in the 3 þ 3, 2 þ 4, 4 þ 4 a, and 5 þ 5 a designs and follow the algorithm described in the methods section. The designs that also allow de-escalation will target a slightly lower DLT rate than their counterparts that allow only escalation. One method to estimate the approximate target DLT rate of each design that also allows de-escalation is to run simulations for each design using several different dose-toxicity curves and then perform the following calculation: one needs to compute the sum of the product of the true DLT rate at each dose and the probability that that dose is selected as the MTD from simulations for each scenario and then find the average of this value across the various scenarios (dose-toxicity curves). Based on our results for the logistic, log-logistic and linear dose-toxicity curves in Tables 3e5, we find that the approximate target DLT rate of the 3 þ 3 design with de-escalation is 0.17, of the 2 þ 4 design with de-escalation is 0.18, of the 4 þ 4 a design with de-escalation is 0.21 (which is why we also included the 4 þ 4 a design, even though its target DLT rate for the escalation only case is a little higher than 0.2), and of the 5 þ 5 a design with de-escalation is 0.17. The 3 þ 3þ3 design targets an approximate DLT rate of 0.21.  Table 1.
Toxicity Probability Interval (mTPI), Toxicity Equivalence Range (TEQR), Bayesian Optimal Interval Design (BOIN), Continual Reassessment Method (CRM) and Escalation with Overdose Control (EWOC) designs and explore their statistical operating characteristics. For these designs, we can choose the DLT rate that each design will target; we specify a target DLT rate of 0.2 for all of them, in order to compare their performance with the performance of the 3 þ 3 design and its extensions that target a DLT rate of~0.2. Note that although the TEQR design is not a model based design, it allows the specification of the target DLT rate. The mTPI design is described in detail in the reference by Ji and others [13]. The mTPI design is a Bayesian dose finding design that uses the posterior probability in guiding dose selection. The mTPI design uses a statistic for the decision rules called the unit probability mass (UPM), defined as the ratio of the probability mass of the interval and the length of the interval [13]. The toxicity probability scale is divided into three portions: (0, p T -ε 1 ) corresponding to under-dosing, [p T -ε 1 , p T þε 2 ] corresponding to proper dosing and (p T þε 2 , 1) corresponding to over-dosing. Here p T is the target probability of dose limiting toxicity and ε 1 and ε 2 are used to define the interval for the target DLT rate. The rules for escalating, staying at the same dose or de-escalating depend on which of these portions has the highest UPM for that dose level, based on a betabinomial distribution with a beta(1,1) prior [13,14]. For example, the next cohort of patients will be treated at the next higher dose level if the UPM is the largest for the under-dosing interval. The trial stops if dose level 1 is too toxic or if the maximum sample size is reached or exceeded.
The TEQR design is a frequentist version of the mTPI design and is described in detail in the reference by Blanchard and Longmate [14]. This design is not based on the posterior probability but on the empirical DLT rate. The unit interval is divided into three portions: (0, p T -ε 1 ), [p T -ε 1 , p T þε 2 ] and (p T þε 2 ,1). The rules for escalating, staying at the same dose or de-escalating depend on which of these portions contains the empirical DLT rate for that dose level e if the empirical DLT rate lies between 0 and p T -ε 1 , we escalate; if it lies in the interval [p T -ε 1 , p T þε 2 ], we stay at the same dose; if it lies above p T þε 2 , we de-escalate. In both the mTPI and TEQR design, we stay at the current dose if the current dose is safe but the next higher dose is too toxic based on the data. A trial using the TEQR design stops if dose level 1 is too toxic or when a dose level achieves the selected MTD sample size. In a trial using the TEQR or the mTPI design, the MTD is determined to be the highest dose level with a DLT rate that is closest to (and below) the target DLT rate after applying isotonic regression at the end of the trial.
The concept of the BOIN design is similar to that of the TEQR design in terms of dividing the toxicity probability scale into three intervals and using these intervals along with the empirical DLT rate to guide dose finding [15]. In contrast to the TEQR and mTPI designs, where the interval for the target DLT rate is fixed and is independent of the dose level and the number of patients that have been treated at that dose level, the BOIN design is more general and permits this interval to vary with the dose level and the number of patients that have been treated at that dose level. In this design, the probability of patients being assigned to very toxic doses or to subtherapeutic doses is low. A trial using the BOIN design usually stops at the pre-planned sample size but the design allows the incorporation of early stopping rules.
The CRM design and its variations are well-known and are described in several references [25e28]. This design uses the DLT information obtained from all the previous patients to determine the dose level to which the next patient (or cohort of patients [28]) is assigned. The first patient may be given a dose whose DLT rate is expected to be close to the target DLT rate based on information from previous studies, although caution usually dictates starting at a lower dose level. The dose given to each subsequent patient is decided by the DLT data of all the previous patients in conjunction with a dose-toxicity model for e.g. a one parameter logistic model with parameter "a". The estimates of "a" in the dose-toxicity model are updated using Bayesian methods after the DLT information from each patient is obtained. For example, after n patients are enrolled, f ðajU n Þ is the posterior density of a, g(a) is the prior distribution for a, L Un ðaÞ is the likelihood function, and U n are the DLT data after n patients [29]. The dose-toxicity model is then used to recommend the dose level for the next patient, typically the dose with a DLT rate closest to but less than the updated DLT estimate from the model, subject to not skipping over untested doses. The stopping point for this process is usually the pre-determined sample size of the trial or an observation of no change in dose assignment for a sequence of n patients.
The EWOC design is a Bayesian adaptive dose finding design, whose unique feature is over-dose control i.e. the posterior probability of treating patients at doses above the MTD, given the data, cannot be greater than a certain pre-specified probability a [16,17]. In mathematical terms, we specify a prior distribution for (r 0 , g), where r 0 is probability of DLT at the minimum dose and g is the MTD dose, and let P n (g) be the marginal posterior cdf of g given D n (DLT data after n patients). The first patient receives the dose x 1 , and conditional on the event of no DLT at x 1 , the (nþ1) th patient receives the dose x nþ1 ¼ P À1 n (a), which implies that the posterior probability of exceeding the MTD is equal to a [17]. The design also minimizes the under-dosing of patients. This means that the MTD is generally reached rapidly, and after the initial cohorts of patients, the remaining cohorts of patients are treated at dose levels reasonably close to the MTD. In this design, it is also possible to add a stopping rule for excessive toxicity for e.g. the trial will be stopped early if three consecutive DLTs are observed or if the posterior probability at the minimum dose exceeds a certain pre-defined value.

Simulations of rule based designs
For our simulations in SAS of the 3 þ 3 design and its extensions, we use a Bernoulli random generator, along with the probability of a DLT at different doses generated by a dose-toxicity curve, to randomly assign each patient a DLT or not depending on the probability of a DLT at the assigned dose. We then implement the assignment rules of each design and follow each simulated trial to its conclusion. For example, for the designs that allow only escalation, we escalate until the number of DLTs at the last dose level examined exceeds that allowed by the specific design, and the MTD is then estimated to be one dose level below the last dose level examined. We perform these simulations 10000 times for each combination of design and dose-toxicity curve. The increase in dose at a new dose level beyond dose level 1 for each dose-toxicity curve investigated is based on the modified Fibonacci series (2, 1.67, 1.5, 1.4, 1.33, 1.33, 1.33 etc.), as commonly used in many oncology trials [25].
A logistic dose-toxicity curve is often used to describe the underlying relation between dose and toxicity in cytotoxic agents [22]. Hence, we specify the true DLT probability at each dose based on a specific logistic curve. In addition to the logistic curve, we consider a specific log logistic and a linear dose-toxicity curve to study the performance sensitivity of these designs to the true DLT probabilities generated by these different dose-toxicity curves. Table 2 shows the true DLT rates at each dose level for each of the three dose-toxicity curves. For determining the two unknown coefficients of each dose-toxicity curve, we use the DLT rates at two different doses e namely we assume a true DLT rate of 0.01 at dose level 1 of 100 units and a DLT rate of 0.2 at the true MTD (dose level 3) of 334 units. We assume a DLT rate of 0.2 at the MTD because the 3 þ 3 design targets a DLT rate between 0.2 and 0.25 [19]. Hence this choice of 0.2 allows a fair comparison of the simulation results from the 3 þ 3 design with those from other A þ B designs whose approximate target DLT rate is 0.2 (various A þ B designs target DLT rates other than 0.2; see Table 4.1 of Chapter 4 in the reference by Ting [30]). However, we also study the performance of these designs to different target DLT rates, such as 0.1 and 0.33.
We choose the following broad range of statistical operating characteristics to compare and evaluate the dose finding schemes considered for these three dose-toxicity curves: the accuracy of MTD selection, the average number of dose levels examined and its standard deviation, the maximum and median number of dose levels examined, the mean and median number of patients and the median number of DLTs per trial, the mean number of patients dosed at the MTD, the mean percentage of patients dosed at the MTD, above the MTD and below the MTD, the average number of patients and DLTs at each dose level, the average trial DLT rate and the average DLT rate at the MTD. Further, we investigate the effect of the location of the starting dose relative to the true MTD on the accuracy of MTD selection for the chosen logistic and log-logistic dose-toxicity curves for e.g. when we start our trial simulation at dose level À3, À2 or À1 instead of at dose level 1 (see Table 2; these low doses double each time). In addition, we use three linear dosetoxicity curves with different offsets to investigate the effect of the location of the starting dose relative to the true MTD on the accuracy of MTD selection for the 3 þ 3 design. Our SAS programs, available on request, are presently able to provide results for six designs (3 þ 3, 2 þ 4, 4 þ 4 a, 5 þ 5 a, 3 þ 3þ3, and simple accelerated titration designs) and three dose-toxicity curves (linear, logistic, log-logistic). However, the programs are simple and flexible and can be extended to other A þ B designs as well as any other dose-toxicity curve.

Simulations of model based designs or designs that allow specification of the target DLT rate
We use R code provided by Ji et al. [13] to implement the mTPI design. The program requires the following inputs: number of simulations, target probability of dose limiting toxicity p T and ε 1 and ε 2 that help define the lower and upper bound of the interval for the target DLT rate respectively, sample size, cohort size, starting dose and the true DLT rate at each dose.
We use the R package TEQR to implement the TEQR design. The program requires the following inputs: number of simulations, target probability of dose limiting toxicity p T and ε 1 and ε 2 that help define the lower and upper bound of the interval for the target DLT rate respectively, DLT probability deemed to be too toxic, desired sample size at the MTD, cohort size, maximum number of cohorts, starting dose and the true DLT rate at each dose.
We use the R package BOIN to implement the BOIN design. The program requires the following inputs: number of simulations, target probability of dose limiting toxicity p T , cohort size, number of cohorts, starting dose, cut off to eliminate an overly toxic dose for safety and the true DLT rate at each dose. Although the design allows the possibility of rules for stopping prior to reaching the planned sample size, we did not implement these early stopping rules, to permit fair comparisons between designs.
We use a CRM trial simulator to implement the various scenarios for the CRM design. The program requires the following inputs: number of simulations, maximum sample size, cohort size, number of doses, starting dose, target probability of dose limiting toxicity, stopping probability (the trial is stopped if the probability that the lowest dose is more toxic than the target is greater than this value) and the true DLT rates at the various doses. The probability of DLT at dose i is modeled as p i exp(a) , where p i is a constant and a is distributed a priori as a normal random variable with mean 0 and variance 2. The initial default prior probabilities of DLT used in the software are given in Appendix Table 3. The trial stops when the planned sample size is reached or if the lowest dose is too toxic. We use a web based program to implement the EWOC design. The program requires the following inputs: number of simulations, target probability of dose limiting toxicity, maximum acceptable probability of exceeding the target dose (a), variable a increment, cohort size, sample size, minimum dose, maximum dose, number of dose levels and the true probability of DLT at each dose. Although the EWOC design allows the possibility of rules for stopping prior to reaching the planned sample size, the current implementation of the EWOC design does not include early stopping rules.
The parameters used for mTPI, TEQR, BOIN, CRM and EWOC designs are shown in Appendix Tables 2, 3, 4 and 5. Note that the sample size is an output of the rule-based A þ B designs as well as the TEQR design. For the mTPI, BOIN, CRM and EWOC designs, we use the same sample size that the TEQR design yields for each of the three sets of true DLT rates. For all the simulation results in this section, dose level 1 is the lowest dose (see Table 2) and dose level 3 is the true MTD.
For the logistic dose-toxicity curve constructed, there is a very clear separation between the true DLT rate at the MTD and the rates at the dose levels below and above it: the DLT rate of 0.2 at the MTD versus 0.04 at the dose level below and 0.71 at the dose level above ( Table 2). The DLT rate of 0.2 at dose level 3 aligns with the range of toxicity rates that the escalation-only A þ B designs target (Table 1) and is the target DLT rate specified for the modelbased designs. Hence all the designs pick dose level 3 as the MTD the largest percentage of times in our simulations, while incorrectly picking the other dose levels substantially less frequently (Table 3; also see Appendix Table 1 for exact analytic results for MTD selection for the 3 þ 3 design and its extensions). The 4þ4a design with and without de-escalation, the mTPI design, the CRM design and the 3 þ 3þ3 design correctly pick dose level 3 as the MTD~79%,~80%,~76%,~76% and~76% percent of the time respectively (Table 3 and Fig. 2). The median number of patients enrolled in the trial ranges from 6 for the simple accelerated titration design to 25 for the 5 þ 5 a design. As expected, with the 3 þ 3 design, about half of the patients are given doses below the MTD. The BOIN design and the 5 þ 5 a design with and without de-escalation also treat a large percentage of patients at doses below the MTD e about 50%, 48% and 49% respectively. On the other hand, the simple accelerated titration design over-doses a large percentage of patients (~43%). The model based designs generally treat a large percentage of patients at the MTD. The average trial DLT rate ranges from 0.17 for the TEQR design to 0.4 for the simple accelerated titration design; the median number of DLTs per trial ranges from 2 for the 2 þ 4 design without deescalation to 5 for the 4þ4a design with de-escalation and the 5 þ 5 a design, among the extensions of the 3 þ 3 design considered.
For the log-logistic dose-toxicity curve constructed, there is a clear separation between the true DLT rate at the MTD and the rates at the dose levels below and above it: the DLT rate of 0.2 at the MTD versus 0.06 at the dose level below and 0.42 at the dose level above (Table 2). Although this separation is not as large as it is in the logistic dose-toxicity curve considered, all the designs still pick dose level 3 as the MTD more frequently than they pick any other dose level. The CRM, mTPI, BOIN and 5 þ 5 a with and without de-escalation designs correctly pick dose level 3 as the MTD~74%,~63%,~59%,~58% and~58% percent of the time respectively (Table 4). The median number of patients enrolled in the trial ranges from 7 for the simple accelerated titration design to 30 for the 5 þ 5 a design with de-escalation. For this dosetoxicity curve, about 49% of patients are given doses below the MTD in the 3 þ 3 design. The BOIN, TEQR and 5 þ 5 a design with and without de-escalation also treat a large percentage of patients at doses below the MTD e about 50%, 47%, 47% and 47% respectively. On the other hand, the simple accelerated titration design over-doses a large percentage of patients (~47%). The model based designs generally treat a large percentage of patients at the MTD. The average trial DLT rate ranges from 0.17 for the TEQR design to 0.34 for the simple accelerated titration design; the median number of DLTs per trial ranges from 2 for the simple accelerated titration design, reflecting the very small sample size for this design, to 5 for the 4 þ 4 a design and the 5 þ 5 a design with deescalation, among the extensions of the 3 þ 3 design considered.
For the linear dose-toxicity curve constructed, the DLT rate at The bold highlighting shows the designs predicted by simulations to pick the MTD most accurately, to enroll the largest and smallest number of patients, to dose the maximum percentage of patients at the MTD, to under-dose the maximum percentage of patients, and to over-dose the maximum percentage of patients. Note also that the sum of columns 2 to 4 may add up to <100% because the remaining small percentage of times, no dose level is selected as the MTD.
a The numbers shown in brackets are for a corresponding design that also allows dose de-escalation.

R. Ananthakrishnan et al. / Contemporary Clinical Trials Communications 5 (2017) 34e48
dose level 3 is 0.2 and the DLT rate at dose level 4 is 0.34 (Table 2). Although this separation is even smaller than that in the logistic and log-logistic dose-toxicity curves considered, all the designs except the accelerated titration design (which picks dose level 3 as the MTD 27% of the time versus dose level 4 as the MTD 29% of the time) pick dose level 3 as the MTD more frequently than any other dose level. The CRM, mTPI, 5 þ 5 a with and without de-escalation and TEQR designs correctly pick dose level 3 as the MTD but onlỹ 54%,~45%,~45%,~45% and~45% percent of the time respectively ( Table 5). The median number of patients enrolled in the trial ranges from 8 for the simple accelerated titration design to 30 for the 5 þ 5 a design with de-escalation. For this dose-toxicity curve, about half of the patients are given doses below the MTD in the 3 þ 3 design. The BOIN, TEQR, CRM, mTPI designs and the 5 þ 5 a design with and without de-escalation also treat a large percentage of patients at doses below the MTD e about 58%, 50%, 50%, 48%, 48% and 48% respectively. On the other hand, the simple accelerated titration over-doses a large percentage of patients (~49%). The model based designs generally treat a large percentage of patients at the MTD. The average trial DLT rate ranges from 0.16 for the TEQR design to 0.31 for the simple accelerated titration design; the median number of DLTs per trial ranges from 2 for the simple accelerated titration design to 5 for the 4 þ 4 a and 5 þ 5 a designs, among the extensions of the 3 þ 3 design.
Results for the accuracy of MTD selection for the 3 þ 3 design for all the three dose-toxicity curves considered are presented in Fig. 3; results for some of the other designs are presented graphically in Appendix Figs. 1e3.

Effect of starting the trial at lower dose levels on the accuracy of MTD selection
In the previous section, our simulations are started at dose level 1 for all the rule-based designs, and dose level 3 is the true MTD for all the designs. This means that it takes only two escalations from the starting dose to reach the true MTD in the escalation only designs. However, the accuracy of MTD selection could depend on where the starting dose is located relative to the true MTD, for example if it is located six dose levels below the true MTD versus two, because some dose finding designs may be slow to escalate while others may be fast to do so. Thus, we investigate the effect of starting at lower dose levels on the accuracy of MTD selection in the 3 þ 3 design and its extensions that allow only escalation, using the logistic dose-toxicity curve in Table 2. We find that the number of patients on the trial and the percentage of patients who are under-dosed, both of which are outputs of the program for the rule-based designs, increase when we start at the lower doses, but the accuracy of MTD selection is largely unaffected for all these designs (Table 6). We find similar results for the model based designs. We also find similar results for the log-logistic dosetoxicity curve in Table 2 to those described for the logistic dosetoxicity curve. The result that the location of the starting dose relative to the true MTD does not affect the accuracy of MTD selection may not be surprising since the true DLT rates at dose level À1, À2 and À3 are very small for the logistic and log-logistic dose-toxicity curves used.
In general, the accuracy of MTD selection will be affected when the true DLT rates at these lower dose levels are much greater than 0.01 (say 0.1). We have demonstrated this for the 3 þ 3 design using three linear dose-toxicity curves with different offsets (see Appendix Table 8 and Appendix Fig. 4). In practice, the starting dose of the trial is usually an extremely conservative estimate based on animal studies, and the DLT rates at the first few dose levels are expected to be very low. 1 In this case, the accuracy of MTD selection should not be affected even when the true MTD is several doses above the starting dose in the rule-based escalation only designs considered, and we can enroll patients at the same low starting dose for these designs.  Table 2. These percentages are from simulations and the results are shown in Tables 3e5 .  Table 4 Simulation results: log-logistic dose-toxicity: Log e (DLT rate/(1ÀDLT rate)) ¼ À16.8485 þ 2.66078*log e (dose). The bold highlighting shows the designs predicted by simulations to pick the MTD most accurately, to enroll the largest and smallest number of patients, to dose the maximum percentage of patients at the MTD, to under-dose the maximum percentage of patients, and to over-dose the maximum percentage of patients. Note also that the sum of columns 2 to 4 may add up to <100% because the remaining small percentage of times, no dose level is selected as the MTD. a The numbers shown in brackets are for a corresponding design that also allows dose de-escalation. The bold highlighting shows the designs predicted by simulations to pick the MTD most accurately, to enroll the largest and smallest number of patients, to dose the maximum percentage of patients at the MTD, to under-dose the maximum percentage of patients, and to over-dose the maximum percentage of patients. Note also that the sum of columns 2 to 4 may add up to <100% because the remaining small percentage of times, no dose level is selected as the MTD. a The numbers shown in brackets are for a corresponding design that also allows dose de-escalation.

Discussion
In this work, we have systematically compared via simulations the statistical operating characteristics of various Phase I oncology designs, namely the 3 þ 3 design and its extensions that target a DLT rate of~0.2 as well as the mTPI, TEQR, BOIN, CRM and EWOC designs with a pre-specified target DLT rate of 0.2, for three sets of true DLT rates (generated for the same doses from a specific linear, logistic and log-logistic dose-toxicity curve). Although this is not an exhaustive comparison of all the current Phase 1 oncology designs, we have covered multiple commonly used ones. The 3 þ 3 design is very simple and easy to implement and hence is still commonly used. However, our simulations show, not unexpectedly, that it under-doses a large percentage of patients, and is also not the design that picks the MTD most accurately for any of the dosetoxicity curves examined, with or without de-escalation.
All the designs examined select the MTD fairly accurately when there is a clear separation between the true DLT rate at the MTD and the rates at the dose level immediately below and above it, as is the case for the DLT rates generated using the chosen logistic dosetoxicity curve. However, when this separation is small, as is the case for the DLT rates generated using the chosen linear dosetoxicity curve, the accuracy of MTD selection is much lower. The separations in these true DLT rates depend, in turn, not only on the functional form of the dose-toxicity curve but also on the investigated dose levels and the parameter set-up. The considered A þ B designs with de-escalation generally pick the MTD more accurately than the corresponding escalation-only design for the true DLT rates generated using the chosen log-logistic and linear toxicity curves, but not for the logistic one. Some of the other rule based designs examined pick the MTD more accurately than the 3 þ 3 design, depending on the true DLT rate at each dose. For example, the 5 þ 5 a design is as accurate as the model based designs in picking the MTD for the true DLT rates generated using the chosen log logistic and linear dose-toxicity curves but requires enrolling a larger number of patients compared to the other designs considered (~30 patients) and under-doses a large percentage of patients (~48%) for these dose-toxicity curves. Among the designs investigated, the simple accelerated titration design over-doses a large percentage of patients. Over-dosing of patients in oncology trials is an important issue that needs to be considered carefully in terms of study design since the toxicities at the higher doses can be very harmful to patients. The EWOC design explicitly takes this into consideration; in this design, one can control the expected proportion of patients receiving doses above the MTD by prespecifying the maximum acceptable probability of exceeding the target dose. Although some model-based designs can be more difficult to implement than rule based designs, the model based designs studied, mTPI, TEQR, BOIN, CRM and EWOC designs, perform well and assign the maximum percentage of patients to the MTD, and also have a reasonably high probability (given the small sample size) of picking the true MTD.
In our simulations, we assumed a true DLT rate of 0.2 at the MTD (dose level 3) because it has been shown that the standard 3 þ 3 design targets a toxicity rate between 0.2 and 0.25 [19]. However, when a DLT rate of 0.1 is specified as the target DLT rate, the various A þ B designs considered would not, in general, select the MTD accurately because 0.1 is not within their target range, and when a DLT rate of 0.33 or 0.4 at the MTD is assumed, A þ B designs that target a higher DLT rate would pick the MTD correctly more often than the 3 þ 3 design. For example, for the linear dose-toxicity curve in Table 2, dose level 2 is the true MTD if the target DLT rate is 0.1. In this case and for the extensions of the 3 þ 3 design considered, percentages for correct MTD identification for dose level 2 are lower than those for dose level 3 and range from 14% (accelerated titration design) to 29% (5 þ 5 a with target range 0.2e0.25); percentage for 3 þ 3 is 27% (target range 0.17e0.26). If we consider a 5 þ 5 design that targets a DLT range of 0.1e0.15 (see Table 4.1 of Chapter 4 of the reference by Ting [30]), it selects dose level 2 as the MTD~43% of the time, which is much higher than the percentages with which the 3 þ 3 and the other A þ B designs with a target DLT rate of~0.2 select dose level 2 as the MTD (results for this 5 þ 5 design are not included in any table). Dose level 4 is the true MTD if the target DLT rate is 0.33. If we consider the 4 þ 4 b design (target range 0.38e0.44) and 5 þ 5 b design (target range 0.3e0.35) (see Table 4.1 of Chapter 4 of the reference by Ting [30]), they both select dose level 4 as the MTD~40% of the time (results not shown here). This is much higher than the percentages with which the 3 þ 3 and the other A þ B designs with a target DLT rate of~0.2 select dose level 4 as the MTD for the chosen linear dosetoxicity curve (percentages for correct MTD identification range from 20% to 31%). Results for the accuracy of MTD selection for the 3+3 logis c implies the 3+3 design with the DLT rates generated from the logis c dose-toxicity curve in Table 2 Fig. 3. Depicts the percentage of times that the 3 þ 3 design selects each dose level as the MTD for the true DLT rates given in Table 2, generated from the three dose-toxicity curves. These percentages are from simulations and the results are shown in Tables 3e5. model based designs for the linear dose-toxicity curve given in Table 2 and for the target DLT rates of 0.1 and 0.33 are provided in Appendix Tables 6 and 7 respectively. The accuracy of MTD selection decreases as the target DLT rate increases from 0.1 to 0.33 for the mTPI, TEQR, BOIN and CRM designs, but not for the EWOC design, for the chosen linear dose-toxicity curve. Our simulations for the A þ B and model based designs show that for designs where the approximate DLT rate targeted by the design is known, it is critical to pick a design that is aligned with the true DLT rate of interest.
We also showed that as long as the true DLT rates at the first few dose levels are very low, the accuracy of MTD selection is largely unaffected by the number of escalations it takes to reach the true MTD, for the rule-based escalation only designs considered that target a DLT rate of~0.2.
For the standard 3 þ 3 design, our simulations, where the starting dose is two levels below the true MTD, show that the maximum number of dose levels examined varies between 5 for the logistic dose-toxicity curve and 7 for the linear and log-logistic dose-toxicity curves considered, while the median number of dose levels examined is 4 for all the three dose-toxicity curves. In comparison, a literature review of 41 trials that were performed using the standard 3 þ 3 design found that the median number of dose levels examined was 6 (range 2e12 dose levels), about 45% of the patients were under-dosed and about 20% of the patients were over-dosed [31]. These empirical results are consistent with our simulation findings that the 3 þ 3 design under-doses about 50% of the patients and over-doses about 22% of the patients on the trial, for all the three dose-toxicity curves. The average number of patients enrolled in trials that are based on the 3 þ 3 design is, however, much higher in the literature review with a mean of 44 patients than in our simulations, where we found a mean of~14 patients for all the three dose-toxicity curves. However, this literature review is based on trials of targeted anti-cancer agents that reached the MTD and we do not know the exact percentage of trials that included expansion cohorts, and if the initial cohorts started at very low doses; hence, the above comparisons are not exact. Nevertheless, it is clear from clinical trial data as well as our simulations that Phase I trials are very small and thus may not provide good estimates of the MTD. If we consider designs with a higher average sample size, say 50e60 patients, they will have a much higher accuracy of MTD selection. In the future, it may be worthwhile investing in the enrollment of a larger number of patients even in a Phase I trial to obtain more accurate estimates of the right dose to be used for later Phase trials, although there is always a trade-off between costs (lower number of patients) and more accurate estimates (higher number of patients).

Conclusions
In conclusion, our comprehensive study compares and contrasts the 3 þ 3 design with multiple other Phase I oncology designs with an approximate target DLT rate of 0.2 for various scenarios of true underlying DLT rates, in order to understand which designs pick the true MTD most accurately, which under-dose and over-dose the maximum percentage of patients, which assign the maximum number and percentage of patients to the MTD cohort, which explore the maximum number of dose levels and enroll the most number of patients in each case. Our SAS programs are flexible and can be extended to include other A þ B designs, other dose-toxicity curves as well as other evaluation criteria. The summaries in this paper provide considerable information on design property tradeoffs, and the means to explore additional settings. These may be useful aids in choosing a Phase I design for a particular setting. Same as the median sample size obtained from the TEQR design (the sample size is not a direct input of the program but the number of cohorts is an input and we input the number of cohorts such that the number of cohorts*cohort size is the desired sample size).

Number of cohorts
Desired sample size/cohort size Cut off to eliminate an overly toxic dose for safety 0.95 True DLT rate at each dose level Values from Table 2 Table 2 for each dose-toxicity curve CRM Inputs: The probability of toxicity at dose i is modeled as p i exp(a) , where p i is a constant and a is distributed a priori as a normal random variable a is normally disturbed with mean 0 and variance 2 Prior probabilities of toxicity used are the defaults in the program at dose level 1 ¼ 0.15, at dose level 2 ¼ 0.25, at dose level 3 ¼ 0.3, at dose level 4 ¼ 0.45, at dose level 5 ¼ 0.51, at dose level 6 ¼ 0.56, at dose level 7 ¼ 0.6 Stopping probability (the trial is stopped if the probability that the lowest dose is more toxic than the target is greater than this value) 0.9 The software can be found at: https://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.aspx?Software_Id¼13. After the first cohort, each successive cohort is given the dose whose posterior probability of toxicity given the data collected thus far is closest to the target, subject to one additional requirement: one cannot skip over an untried dose. If the method would otherwise skip over an untried dose, the lowest untried dose is given instead.
Appendix  Table 2 for each dose-toxicity curve Prior distribution r 0~U niform(0, 0.2) (the prior for r 0 , the probability of DLT at the minimum dose, is Uniform(0, 0.2)) g~Uniform(100, 500) (the prior for the maximum tolerated dose g is Uniform(100, 500)) The EWOC software is available at: https://biostatistics.csmc.edu/ewoc/ewocWeb.php. À6 implies that the starting dose is 6 dose levels below the true MTD, and similarly for the others. We observe that for an offset of 0 (when the true DLT rate ¼ 0 for the first 6 dose levels), the accuracy of MTD selection is not affected by how many dose levels below the true MTD the starting dose level is located i.e. the percentage of times (out of 10000 simulations) that dose level 6 (true MTD) is selected as the MTD is constant (~30%) for the different starting dose locations relative to the true MTD. However for an offset of 0.1 (when the true DLT rate ¼ 0.1 for the first 6 dose levels), the accuracy of MTD selection is affected by how many dose levels below the true MTD the starting dose level is located.
5+5 a logis c implies the 5+5 a design with the true DLT rates given in Table 2, generated from the logis c dose-toxicity curve, and similarly for the others.