Accountable survival contrast-learning for optimal dynamic treatment regimes

Dynamic treatment regime (DTR) is an emerging paradigm in recent medical studies, which searches a series of decision rules to assign optimal treatments to each patient by taking into account individual features such as genetic, environmental, and social factors. Although there is a large and growing literature on statistical methods to estimate optimal treatment regimes, most methodologies focused on complete data. In this article, we propose an accountable contrast-learning algorithm for optimal dynamic treatment regime with survival endpoints. Our estimating procedure is originated from a doubly-robust weighted classification scheme, which is a model-based contrast-learning method that directly characterizes the interaction terms between predictors and treatments without main effects. To reflect the censorship, we adopt the pseudo-value approach that replaces survival quantities with pseudo-observations for the time-to-event outcome. Unlike many existing approaches, mostly based on complicated outcome regression modeling or inverse-probability weighting schemes, the pseudo-value approach greatly simplifies the estimating procedure for optimal treatment regime by allowing investigators to conveniently apply standard machine learning techniques to censored survival data without losing much efficiency. We further explore a SCAD-penalization to find informative clinical variables and modified algorithms to handle multiple treatment options by searching upper and lower bounds of the objective function. We demonstrate the utility of our proposal via extensive simulations and application to AIDS data.

This implies that proposed method will generate the minimal value of the cause-1 CIF, which can be achieved by the true optimal treatment regime.

Additional simulation results
We provide the results of numerical studies for single stage treatment allocation, under two different censoring assumptions. First, we consider completely independent censoring situation, i.e., T ⊥ ⊥ C.
The censoring time C i is generated from exponential distribution (i.e., Exp(c 0 )), where the nonnegative constant c 0 is fixed to yield approximately 15% or 30% censoring rate. Additionally, we fix our interest time point t 0 = 3 and 1000 Monte-Carlo simulation studies are conducted. We generate the 500 samples and 15 covariates X i = (x 1i , . . . , x 15i ) T , which are independent and identically distributed with Uniform[−2, 2]. The treatment indicator A i ∈ {0, 1} is generated from binomial distribution with probability (i) , where expit(a) = 1/(1 + exp(−a)). Since we posit the logistic model with x 2i and x 3i , the case (i) will be well specified model (True logistic) whereas the case (ii) will be wrong specified model (False logistic).
For the 3-year survival event, the survival time T i is generated from exp{1.5 + 0.5x 1i + A i (x 2i − x 3i ) + i }, where the error distribution i follows extreme value distribution with rate 1 (i.e., exp( i ) ∼ Exp(1)). True optimal treatment regime can be obtained from g opt 0 = I(x 2 ≥ x 3 ) which can be naturally derived to maximize the survival time T i , and corresponding true maximal 3-year survival probability is S(3, g opt 0 ) = 0.63. For the 3−year CIF in competing risks event, Suppose our be the true Q-function of cause-1 and 2 events, respectively. We generate the cause-1 event from binomial distribution with probability P Here q ∈ (0, 1] is a fixed constant controlling the proportion of cause-1 which produces roughly 54% for 15% censoring and 48% for 30% censoring of the data with q = 0.8 in this case. Under this consideration, with P (D i = 1), cause-1 failure time is generated from F 1 (t|X i , A i ) = 1−{1−q(1−e −t )} ψ 1i (X) whereas cause-2 failure time is generated as follows with probability 1−P (D i = 2), F 2 (t|X i , A i ) = 1−exp{−tψ 2i (X)}. True optimal treatment regime for minimizing cause-1 CIF is g opt 0 = I(x 2 ≤ x 3 ), and the corresponding minimal cause-1 CIF is F 1 (3, g opt 0 ) = 0.36. Table S1 and S2 are the results of 3-year survival probability and cause-1 CIF, respectively. Equivalent to the main manuscript, we consider two naive approaches (g = 0 and g = 1) and four models: outcome weighted learning (OWL[?]) and its DR version (DWL), penalized OWL (POWL) and the proposed penalized DR weighted learning (PDWL). Each model is compared by using four criteria, true curves {S(3,ĝ opt ), F 1 (3,ĝ opt )} and its empirical counterparts {Ŝ(3,ĝ opt ),F 1 (3,ĝ opt )}, and correct decision rate with 50,000 new test data. For each scenario, the best model is highlighted in bold. The empirical standard errors are presented in parentheses. In general, DWL and PDWL show satisfactory performances, while the PDWL show more stable results under increased censoring rate. We can further see that the PDWL has minimal standard error in general case.
Next, we consider the covariate-dependent censoring situation. The censoring time is generated from C i ∼ Exp(−c 0 − x 1 − x 2 ), where c 0 is a constant yielding target censoring proportion. Other set-ups are equal to previous ones. Given the true optimal treatment regime g opt 0 = I(x 2 ≤ x 3 ), the maximal 3-year survival probability S(3, g opt 0 ) is given by 0.65. The results of simulations  are summarized in Table S3. Here, we further simulate "PDWL2", where pseudo-observation is adjusted by inverse-probability of censoring weights with R:eventglm package [?]. Proposed methods still show satisfactory results not being biased although we assume covariate-dependent censoring, and there are no noticeable differences among double-robust methods between covariateadjusted and covariate-unadjusted approaches.