Generalisations of a Bayesian decision-theoretic randomisation procedure and the impact of delayed responses

The design of sequential experiments and, in particular, randomised controlled trials involves a trade-off between operational characteristics such as statistical power, estimation bias and patient benefit. The family of randomisation procedures referred to as Constrained Randomised Dynamic Programming (CRDP), which is set in the Bayesian decision-theoretic framework, can be used to balance these competing objectives. A generalisation and novel interpretation of CRDP is proposed to highlight its inherent flexibility to adapt to a variety of practicalities and align with individual trial objectives. CRDP, as with most response-adaptive randomisation procedures, hinges on the limiting assumption of patient responses being available before allocation of the next patient. This forms one of the greatest barriers to their implementation in practice which, despite being an important research question, has not received a thorough treatment. Therefore, motivated by the existing gap between the theory of response-adaptive randomisation (which is abundant with proposed methods in the immediate response setting) and clinical practice (in which responses are typically delayed), the performance of CRDP in the presence of fixed and random delays is evaluated. Simulation results show that CRDP continues to offer patient benefit gains over alternative procedures and is relatively robust to delayed responses. To compensate for a fixed delay, a method which adjusts the time horizon used in the optimisation objective is proposed and its performance illustrated.


A.1. Backward recursion algorithm for generalised CRDP
The backward recursion algorithm from the theory of dynamic programming runs backward in time step variable t, starting at the end of the time horizon, i.e. t = T , and decreasing to t = 0. At each time step t, it runs through all of the joint states z = (s A , f A , s B , f B ) reachable at that time step, i.e. those satisfying s A + f A + s B + f B = t. Note that at time step t, we assume that t patients have been allocated using this procedure and t outcomes have been observed. However, in the case of delayed responses with fixed delay d, for example, we would only have observed t − d outcomes from patients allocated using this procedure, and another d outcomes from patients allocated during the initial phase with equal fixed randomisation.
If t = T , there is no reward to receive from allocating patients because no more patients will arrive. Thus, we only consider the penalties and, consequently, F T (z) = Q (z) for all z that sum to T .
For t = T − 1 to t = 0, the value-to-go functions F t satisfy the Bellman equation which allows them to be expressed recursively as functions of F t+1 's. Suppose now that z is such that Denote the unit vector of length four by e i , with the i-th element equal to 1. We decompose the time step into three sub-steps: (1) pre-decision, i.e. before making the action choice when penalty-involving reward Q (z) is incurred, (2) post-decision, i.e. after making the action choice, but before effectuating the randomisation, and (3) post-allocation, i.e. after effectuating the randomisation resulting in a patient allocation, during which a patient response is observed and the trial state is updated at the beginning of the next time step.
Proceeding backwards, we first define the post-allocation quantities of the value-to-go function. If treatment A is allocated to the next patient, then the value-to-go function under an optimal policy is where s j = s j,0 + s j and f j = f j,0 + f j for treatment j represents the prior information and observed data combined.
Alternatively, if treatment B is allocated to the next patient, then the value-to-go function under an optimal policy is Second, we define the post-decision quantities of the value-to-go function. If action a = 1, then the value-to-go function under an optimal policy is F 1 , and analogously when action a = 2, that is, Finally, the pre-decision quantities of the value-to-go function are defined as , then it is optimal to employ action 1, and vice versa. If they are equal, then both actions are optimal choices, and one would equally randomise between them to avoid any systematic allocation bias.  Similar patterns of results are observed for the DP procedure as for the CRDP procedure, but an increased delay brings much higher benefits for statistical operating characteristics in the DP case. This is because the baseline statistical performance of DP is very poor due to the lack of randomisation and constraining, meaning a greater level of imbalance can occur (note that the scale of the bias and MSE plots for the DP is much larger than that used for the corresponding CRDP plots).

A.2. Simulation results for DP with delayed responses
Consider the fixed delay case in Fig. A.8 with θ B = 0.1. While the no delay case has a power around 0.17, a delay of d = 5 increases it to around 0.51 and a delay of d = 15 to 0.83. At the same time, the percentage of patients on the superior treatment decreases from 94% to 91% and 86%, respectively. A delay of around d = 22 introduces sufficient balancing effects (on average, at least 11 observations on each arm) to bring DP to perform akin to CRDP in the no delay case (in which the degree of constraining penalises end-of-trial states with ≤ 11 observations). When the delay is 25 (one third of the trial size), there is a loss of approximately 15% in patient benefit relative to the value attained in the no delay case. However, the percentage of patients on the superior treatment is still approximately 30% larger than with equal fixed randomisation. In terms of the power, a delay of 25 increases it to around 0.93 (almost 80% greater than when there is no delay), which is very close to the power obtained by equal fixed randomisation. Therefore, by introducing a delay in response, although the DP procedure is now adapting based on reduced information, it continues to allocate a considerably

Fig.
A.12 compares the performance of (CR)DP to the DRPWR in trials with a fixed delay and no treatment difference. The first plot in Fig. A.12 illustrates the changes in type I error rates for the (CR)DP and DRPWR as the delay increases. The type I error rate of (CR)DP appears to first increase and then decrease with d because there are two opposing forces involved: conservatism of the Fisher's exact test (especially for small sample sizes) and increased error caused by the RAR. Recall that the desired significance level is 0.1. However, under equal randomisation, Fisher's exact test is not reaching that level due to the conservatism of the test and the attained level is in fact 0.07 (represented by the red dashed line). As the delay length increases, (CR)DP behaves similarly to equal randomisation and, thus, the type I error rate approaches the attained significance level of 0.07 (which is why we observe a decrease). If the test was attaining the nominal level of 0.1, then we would observe inflation of the type I error due to the RAR.
The type I error rates for the DRPWR are consistently smaller, albeit very slightly, than those for (CR)DP (with delay) until around d = 60, after which they perform similarly. Since the treatments have the same success rates, the percentage of patients allocated to either treatment behaves accordingly (close to 50%) and the bias values lie within (−0.001, 0.001) irrespective of the procedure or delay length. Fig. A.13 compares the performance of (CR)DP to the DRPWR in trials with a random delay and no treatment difference.  The first plot illustrates the changes in type I error rates for (CR)DP and DRPWR as the expected delay increases. As in the fixed delay case, after an initial increase for (CR)DP, the type I error rate then decreases gradually. In contrast, the type I error for DRPWR remains relatively constant.

A.5. Comparison of fixed and random delays on (CR)DP
Here, we compare the performance measures of the (CR)DP with a fixed delay versus (CR)DP with a random delay for a specific scenario in which θ A = 0.5 and θ B = 0.1. We have calibrated the random delays so that we expect them to be the same length, on average, as the fixed delays. We use this comparison purely for illustrative purposes to highlight the differences that can occur as a result of the delay being random rather than fixed. Fig. A.14 shows that there is a smaller power, more patients on the superior treatment and a larger bias observed. It is interesting to note that for d = n, the percentage of patients on the superior treatment is 50% when the delay is fixed, as expected, but closer to 70% for CRDP and 79% for DP if it is random (see the middle plot in Fig. A.14). This is because there will be some patients with a small (or no) delay, by random chance, in which case the (CR)DP procedure still adapts relatively quickly and leads to a higher patient benefit (see Fig. A.15). Similarly for the bias, which is not converging to 0 as quickly as in the fixed delay case. Moreover, when the (expected) delay is small (d = 0 and d = 5), we observe that the performance of CRDP is similar regardless of whether the delay is fixed or random. However, for larger (expected) delays, random delays affect the performance similarly to much shorter fixed delays, e.g. random d = 25 is akin to fixed d = 15, random d = 50 is akin to fixed d = 25, and random d = 100 seems akin to fixed d = 35.
A.6. Adjusting the time horizon of DP   Power/type I error, % of patients on the superior treatment, the average bias and MSE of the treatment effect estimator for DP and DP-TH when n = 75, θ A = 0.5 and θ B ∈ (0.1, 0.9) for different delay lengths d (estimated over 1, 000, 000 simulations). A.17. Probability of allocating a patient to the superior treatment when θ A = 0.5 and θ B = 0.9 in a trial of size n = 75 (estimated over 1, 000, 000 simulations). The black and red lines correspond to DP with time horizons T = n and T = n − d, respectively.