Rejoinder of “Dynamic treatment regimes: Technical challenges and applications”

Main article 10.1214/14-EJS920. Eric B. Laber is in the Department of Statistics at North Carolina State University, 2311 Stinson Dr., Raleigh, NC, 27695 (E-mail laber@stat.ncsu.edu). He acknowledges support from NIH grant P01 CA142538 and PR Department of Natural and Environmental Resources (PRDNER) grant PR-W-F14AF00171. Daniel J. Lizotte. is in the Department of Computer Science at the University of Waterloo, Ontario, N2L G31. He acknowledges support from the Natural Sciences and Engineering Research Council of Canada. Min Qian is in the Department of Biostatistics at Columbia University, New York City, NY, 10032. She acknowledges support from NSF grant DMS-1307838. Susan A. Murphy is in the Departments of Statistics and Psychiatry and in the Institute for Social Research, all at the University of Michigan, Ann Arbor, MI, 48109. She acknowledges support from NIMH grant R01-MH-080015 and NIDA grant P50-DA-010075.


Contents
We thank the discussants for their thoughtful and provocative discussions; they have raised several fundamental philosophical questions, suggested new and interesting directions for research, and established new properties of our proposed confidence interval approach. We also thank the Editor for organizing these discussions and for providing helpful feedback on earlier versions of our manuscript. Because the discussants touched on similar themes, we have organized our responses into categories: (i) choosing an appropriate target for inference; (ii) new approaches to estimation and inference; and (iii) scalability and direct estimation.
1. Choosing an appropriate target for inference 1.1. Confidence intervals for c ⊺ β * 1 Consider two potential applications of confidence intervals for c ⊺ β * 1 : (U1) forming a confidence interval for Q 1 (h 1 , 1) − Q 1 (h 1 , −1) to determine if there is sufficient evidence to recommend one treatment over the other for a patient presenting at baseline with H 1 = h 1 ; and (U2) to construct confidence intervals for individual components of β * 1 to determine which patient characteristics are important for tailoring treatment. Robins and Rotnitzky (RR hereafter) argue that for (U1) β * 1 is not clinically meaningful as it assumes that the decision maker will select treatment using π dp 2 at the second stage, which is likely untrue. RR propose developing confidence intervals for c ⊺ β * 1 ( π 2 ) defined as follows. For any second stage decision rule, say π 2 , define so that β * 1 (π 2 ) denotes the optimal first stage coefficients assuming that a decision maker will assign treatments according to π 2 at the second stage. Thus, RR argue that if the estimated optimal regime π 2 is to be used to assign treatments at the second stage then the data-dependent parameter β * 1 ( π 2 ) may be more clinically meaningful than β * 1 ( = β * 1 (π dp 2 ) in their notation). In a pleasant surprise for us, RR showed that our proposed adaptive confidence interval c ⊺ β * 1 is also a valid confidence interval for c ⊺ β * 1 ( π 2 ) at least in the case of binary predictors. We look forward to generalizations of this result.
We maintain that if (U2) is of interest, that is, the confidence intervals are used to inform scientific theory and generate hypotheses for subsequent studies, then confidence sets for β * 1 are scientifically meaningful. In this context it may be of interest to identify which patient covariates are important for optimal treatment choice at each stage. One way to identify important covariates under the optimal regime is to look at confidence intervals for components of β * 2,1 and β * 1,1 . In our experience, estimation of optimal DTRs is often conducted as a secondary, exploratory analysis in which case confidence intervals for β * 1 are likely of interest.

An alternative value function
Goldberg, Song, Zeng, and Kosorok (GSZK hereafter) propose a new measure of the quality of a DTR. Traditionally the value of a DTR, say π, is defined as the expected outcome if all patients in the population of interest are assigned treatment according to π (see Schulte et al., 2013). Because the value of an estimated optimal DTR is nonregular when a non-null subgroup of patients have a small treatment effect in one or more of the treatment stages, GSZK propose a new estimand, called the truncated value of π, which is the expected outcome under π but restricted to the population of patients with clinically meaningful treatment effects at each stage. It is claimed (p. 6) that the truncated value can be made to be arbitrarily close to the value yet is regular and asymptotically normal under regularity conditions. We show that if one can obtain a regular estimand whose distance from the value can be controlled then one can obtain valid confidence intervals for value. We also demonstrate in a one-stage setting that the proposed truncated value may be arbitrary far from the value under certain generative models.
For an estimated DTR, say π n , let V ( π n ) = E πn Y be the value of π n . Suppose that there exists an alternative estimand V ǫ ( π n ) for which: (P1) there exists estimator V ǫ,n ( π n ) so that √ n( V ǫ,n ( π n ) − V ǫ ( π n )) is regular and asymptotically normal; and (P2) |V ǫ ( π n ) − V ( π n )| ≤ b(ǫ) + o P (1) for some (possibly random) function b(ǫ) for which there exists consistent estimator b n (ǫ) that satisfies P (|V ǫ ( π n ) − V ( π n )| ≤ b n (ǫ)) = 1 − o(1). Because V ǫ,n ( π n ) is regular and asymptotically normal for α ∈ (0, 1) we can apply standard methods to construct consistent estimators, say u n and ℓ n , of the (1 − α/2) × 100% and (α/2) × 100% percentiles of the sampling distribution of Thus, one potentially fruitful direction for inference for the value function is to develop smooth approximations to the value with a known or consistently estimable bound on the approximation error. However, the truncated value proposed by GSZK need not satisfy (P1). We demonstrate this using a one-stage decision problem. The observed data are where: X ∈ R p denotes pre-treatment patient information; A ∈ {−1, 1} denotes the treatment received; and Y ∈ R is the outcome coded so that higher values are better. Let X 0 ∈ R and X 1 ∈ R p be known features of The estimated optimal policy is π n (x) = sgn(x ⊺ 1 β 1 ); under standard regularity conditions √ n( β − β * ) is asymptotically normal. Assume A is randomly assigned with P (A = 1|X) = P (A = −1|X) = 1/2 with probability one, then under sufficient regularity conditions (Qian and Murphy, 2011;Zhao et al., 2012;Zhang et al., 2012) where the last term is not asymptotically normal and is highly sensitive to the distribution of X 1 and value of β * 1 . Note also that if can be made arbitrarily large (small); this underscores the need for a data-dependent bound on the approximation error.
A natural choice for Ψ is a sigmoid function though other choices are possible. Let V Ψ ( π n ) = 2P n Y Ψ(AX ⊺ 1 β 1 ); under mild moment assumptions a Taylor series argument shows that √ n( V Ψ ( π n ) − V Ψ ( π n )) is regular and asymptotically normal. The approximation error is bounded by We conjecture that the performance of this method will be sensitive to the choice of Ψ. If Ψ closely approximates the step function 1 u>0 then the approximation error bound b(ǫ) will be small but it may be difficult to form a high-quality estimator of the sampling distribution of √ n( V Ψ ( π n ) − V Ψ ( π n )) because the derivative of Ψ must be very large near the origin; on the other hand, if the derivative of Ψ is not large near the origin then it may be possible to obtain a high-quality estimator of the sampling distribution of √ n( V Ψ ( π n ) − V Ψ ( π n )) but the approximation error bound may be large.

Coherent confidence intervals
RR illustrate the perils of using separate confidence intervals for the decision boundary at each patient history rather than a single joint confidence interval for the entire decision boundary across all patient histories. In the RR example, there are three possible patient histories coded x ∈ {−1, 0, 1}, two treatments a ∈ {−1, 1}. Thus, there are eight regimes, each represented as a triple (a −1 , a 0 , a 1 ), where a x ∈ {−1, 1} denotes the treatment assigned to a subject with X = x. In the RR example, the postulated model class for the decision boundary excludes the regime (1, −1, 1) yet this regime could be implemented by a clinician who selects treatments using clinical judgment whenever a univariate confidence interval for the decision boundary at a given patient history contains zero. Thus, using separate confidence intervals for each patient history can lead to "incoherent" clinician behavior. However, instead of evaluating the performance of these confidence intervals by considering a population of clinicians, we argue that we should evaluate the performance across a population of patients. Thus, the question from our point of view is whether any patients are being treated inappropriately. A patient with history x = −1 is unaffected by what their treatment would have been had their history been 0 or 1. In the RR example suppose that all clinicians apply the "incoherent" regime (1, −1, 1). Patients with x = 1 receive the (estimated) optimal treatment a = 1; patients with x = −1 receive a = 1 which is not dominated by a = −1 and is thus appropriate; and similarly, patients with x = 0 receive a = −1 which is not dominated by a = 1 and thus appropriate.
However moving to the multi-stage setting we appreciate RR's point of view in regards to potentially incoherent sequences of treatments being assigned to a patient. Indeed some of us, Laber et al. (2014) define the set of feasible regimes as those that are consistent with a partial ordering induced on the treatments (e.g., a partial ordering induced by separate confidence intervals for each patient history) and representable in the postulated model space (they did not explore incoherence further). We and others are currently working on this problem.

Soft-max Q-learning
GSZK propose a soft-max Q-learning as an alternative to Q-learning. Because soft-max Q-learning involves only smooth operations of the data standard methods for inference apply. We show that soft-max Q-learning can be viewed as regular Q-learning applied to a stochastic policy in which the propensity of the clinician to follow the estimated optimal second stage policy varies with the estimated second stage effect size. Soft-max Q-learning uses predicted outcomeY = H ⊺ 2,0 β 2,0 + α −1 log(1 + exp{αH ⊺ 2,1 β 2,1 }). It can be shown that sup v∈R |α −1 log(1+exp{αv})−expit(αv)v| = o(1/α) as α → ∞, where expit(u) = exp(u)/(1 + exp(u)). Let D denote the observed data and let π α 2 (h 2 ) be a stochastic second stage policy that satisfies P ( π α 2 (h 2 ) = π 2 (h 2 )|H 2 = h 2 , D) = expit(αH ⊺ 2,1 β 2,1 ). A clinician acting according to π α 2 is increasingly more likely to recommend treatments consistently with the Q-learning estimated optimal second stage decision rule as the magnitude of the estimated second stage effect size, H ⊺ 2,1 β 2,1 , increases. Then, E( Q 2 (H 2 , π α 2 (H 2 ))|H 2 , D) = H ⊺ 2,0 β 2,0 + expit(H ⊺ 2,1 β 2,1 )H ⊺ 2,1 β 2,1 ≈Y . Thus, soft-max Q-learning can be viewed as estimating the optimal first stage decision rule assuming the clinician will follow a stochastic policy described by π α 2 at the second stage. Because soft-max Qlearning is smooth, it should possible to conduct the conditional inference for the first stage Q-function as suggested by RR; such conditional inference would therefore accommodate not only the data-driven decision rule at the second stage (as suggested by RR) but also uncertain clinician behavior.

Variable screening for SMARTs
Hsu and Small (HS hereafter) propose a method for identifying variables that may be of interest for follow-up investigation within the context of a SMART. In the setup considered by HS the observed data on each subject are (A, D, Y ) where: A ∈ {0, 1} is a randomized treatment; D ∈ {0, 1} is an intermediate (post-treatment) outcome; and Y ∈ R is a distal outcome. For simplicity, assume that Y is binary and coded to take values in {0, 1}. HS's idea is to use D as an indicator that a subject's initial treatment should have been switched; thus, in a future study upon observing D the clinician has the opportunity to switch from treatment a to treatment 1 − a. Let {(Y (a=j) , D (a=j) )} j=0,1 denote the set of potential outcomes. Define C k j = P (Y (a=k) = 1, Y (a=1−k) = 0|D (a=k) = j)− P (Y (a=k) = 0, Y (a=1−k) = 1|D (a=k) = j) for j, k ∈ {0, 1}. In this setting (1) in HS is equivalent to C 1 1 > C 1 0 . For clarity we describe this in words.
(1) only considers a population of individuals who start off on treatment, a = 1. C 1 1 > 0 means that among the subpopulation of individuals experiencing D = 1 in response to a = 1, a higher fraction have started off on their optimal treatment than have started off on their non-optimal treatment. Similarly C 1 0 > 0 means that among the subpopulation experiencing D = 0 in response to a = 1, a higher fraction have started off on their non-optimal treatment than have started off on their optimal treatment. Thus C 1 1 > C 1 0 is a comparison of differences in fractions, one difference per subpopulation. Among those who experience D = 1 the difference in fractions of individuals who started off on their optimal treatment versus did not start off on their optimal treatment is higher than the same fraction among those who experience D = 0.
At first it was unclear to us how (1) in HS's discussion might be used to inform treatment decisions. However suppose we are willing to assume no carryover effect. That is, if a patient switches treatment then the patient's long run outcome Y will only depend on the new treatment. In this case consider instead testing if C 1 j C 0 j < 0 for each j = 0, 1; if the test indicates that C 1 j C 0 j < 0, assuming that there is no carry-over effect, D could be used with future patients to dictate a treatment switch. If the estimatedĈ k j < 0 butĈ 1−k j > 0 then future patients initially treated with a = k experiencing intermediate outcome D = j might be switched to a = 1 − k. Thus, in this sense, the method proposed by HS can be used to inform treatment decisions.

Scaling estimation an inference to large problems
Banerjee (B hereafter) asked about computation and asked how the methods and technical issues scale to settings with high-dimensional or continuous actions. Computation of the bounds requires solving a non-convex optimization problem of the form where ω 1 , . . . , ω n , d 1 , . . . , d n , e 1 , . . . , e n ∈ R and r 1 , . . . , r n ∈ R p are known fixed constants. In our simulations the dimension of γ was sufficiently small so that we could compute an approximate solution using a stochastic search; the search was tuned to ensure a sufficient number of points were evaluated before termination. However, there are several other approaches that could be used to compute an approximate solution to (1). One approach that scales well to high-dimensional (large p) problems is coordinate descent. Note that (1) is continuous and piecewise linear. Suppose first that p = 1. In this case it can be seen that any optimal solution γ * must solve r i γ * + d i = 0 or r i γ * + e i = 0 for some 1 ≤ i ≤ n. Thus, there are at most 2n candidate solutions that must be examined to find an optimal solution in the 1d case. For any v ∈ R p let v (j) = (v 1 , . . . , v j−1 , v j+1 , . . . , v p ) and define Computing (2) is equivalent to solving (1) with p = 1 and thus requires evaluation of 2n potential solutions. The coordinate descent algorithm is as follows: (i) initialize γ to a starting value in R p ; (ii) repeat updates of the form γ j = γ j (γ (j) ) cycling over j = 1, . . . , p until changes in γ are sufficiently small. Coordinate descent is easily to implement and the evaluation of the 2n potential solutions required for each update is trivial to parallelize; we have had success with this approach in other contexts with similar non-differentiable but continuous objectives (Laber and Murphy, 2013). An alternative approach to computing (1) is to recognize that it can be written as which is the difference of two convex functions and thus (1) is a DC programming problem. Hence, one can apply DC algorithms (see Horst and Thoai, 1999, and references therein) to approximate (1). The DC algorithm has a track record of being effective in large problems but our experience with this algorithm is limited.
A related problem raised by B is a discrete but high-dimensional treatment. However, unlike the continuous treatment setting, with a high-dimensional discrete treatment there may be no notion of smoothness across values of treatment thereby making it difficult to pool information across units of observation. Without massive amounts of data additional structure must be introduced into the working models. One setting in which this occurs is when treatments are made at each time point across a series of spatial locations. Suppose for simplicity that one can apply one of two treatments, say 0 or 1 at each of N locations; in this setting a ∈ {0, 1} N . However, it may be reasonable to assume that a treatment applied at a given location diffuses quickly over space and that any heterogeneity in the treatment effect over space can be modeled parametrically using observable covariates at each location.

Definition of a direct estimator
RR correctly note that the estimator that we discuss in Appendix A should not be viewed as providing an estimator of the optimal DTR. Note that our asymptotic results describe the behavior of the parameter estimators about their limiting values and do not refer to an optimal DTR. Unfortunately, this was not clear in our description.
We believe that in some settings there are other, equally valid or better, criteria for assessing the quality of direct estimators than consistent estimation of the parameters indexing the estimated optimal DTR. In some settings, constructing a high-quality DTR from within a pre-specified class, e.g., restricted to be parsimonious, logistically feasible, low-cost, etc., may be more of a priority than constructing a consistent estimator of π opt (Orellana et al., 2010;Zhang et al., 2012Zhang et al., , 2013. In such cases, performance guarantees in the form of error bounds, i.e., bounds on the difference between the value of an estimated policy and the value of π opt , may be more appropriate. Error bounds have been used extensively in computer science (e.g., Bartlett et al., 2006, and references therein) where the focus is primarily on performance rather than making statements about the true structure of the optimal DTR. For example, Zhao et al. (2012) derive error bounds on a direct estimator of the form used in Appendix A.