iqLearn: Interactive Q-learning in R

,


Introduction
In practice, clinicians and intervention scientists must adapt treatment recommendations in response the uniquely evolving health status of each patient.Dynamic treatment regimes (DTRs) formalize this treatment process as a sequence of decision rules, one for each treatment decision, which map current and past patient information to a recommended treatment.A DTR is said to be optimal for a pre-specified desirable outcome if, when applied to assign treatment to a population of interest, it yields the maximal expected outcome.
With the potential for better patient outcomes, reduced treatment burden, and cost, there is growing interest in personalized treatment strategies (Hamburg and Collins, 2010;Abrahams and President, 2010).Sequential Multiple Assignment Randomized Trials (SMARTs Lavori and Dawson, 2004;Murphy, 2005a) are designed for the estimation of optimal DTRs.In a SMART, subjects are randomized to treatment at each decision point or stage of the trial.Figure 1 contains a visual representation of an example SMART toy example where all subjects receive the same treatment at baseline (e.g., possibly a standard of care).After some period of time in the baseline stage, patients are then randomized (represented by gold circles) at the start of the first stage to one of two treatment categories: "switch" or "augment" current treatment.After some period of time in the first stage, subjects are again randomized to either switch or augment their current treatment(s) in the second stage.There are many variations of this design; for example, more than two treatments can be offered at each stage, and for ethical reasons it is common to include an option for baseline or firststage responders to continue their currently successful treatment.Although it is possible to design a trial with additional stages, two stage SMARTs are common, as evidenced by many recently completed and ongoing SMARTs.For a list of SMARTs that have finished or are in the field, see The Methodology Center At Pennsylvania State University (2012) and Eric Laber's current list Laber (2013).With additional randomizations beyond one or two stages, the number of patients assigned to each sequence of treatments decreases, along with the power to estimate optimal decisions in the later stages.In principle, the sequential randomization scheme in SMARTs guarantees that there are no confounders that influence which types of subjects follow each of the possible treatment sequences.To keep our discussion focused, we will work under the assumption of a two-stage SMART with randomized binary treatments at each stage.However, all the methods discussed here apply to observational data when additional assumptions are made on the treatment assignment mechanism (see, for example, Murphy, 2003;Moodie et al., 2012).
We introduce package iqLearn (Linn et al., 2013) in R (R Development Core Team, 2004) for estimating optimal DTRs from data obtained from a two stage trial with two treatments at each stage using Interactive Q-learning (IQ-learning; Laber et al., 2013).Although we recommend using IQ-learing instead of Q-learning in most practical settings based on developments in Section 2 and Laber et al. (2013), a comparison of the regimes estimated by the two methods may be of interest to some data analysts.Thus, functions for estimation of a regime by the Q-learning algorithm are also included in iqLearn for completeness.
Introductions to both Q-and IQ-learning are provided in Section 2. Section 3 provides a case-study illustrating the iqLearn package.A brief discussion of future work concludes the paper in Section 4.
2 Q-learning and Interactive Q-learning We assume data are collected from a two-stage randomized trial with binary treatments at each stage, resulting in n i.i.d.patient trajectories of the form (X 1 , A 1 , X 2 , A 2 , Y ).The variables in the trajectory are: baseline covariates, X 1 ∈ R p 1 ; first-stage randomized treatment, A 1 ∈ {−1, 1}; covariates collected during the first-stage but prior to second-stage treatment assignment, X 2 ∈ R p 2 ; second-stage randomized treatment, A 2 ∈ {−1, 1}; and the response, Y ∈ R, collected at the conclusion of the trial.We assume Y has been coded so that higher values indicate more positive clinical outcomes.To simplify notation, we group variables collected prior to each treatment randomization into a history vector H t , t = 1, 2. That is, A DTR is a pair of functions π = (π 1 , π 2 ) where π t maps the domain of H t into the space of available treatments {−1, 1}.Under π a patient presenting at time t with history H t = h t is assigned treatment π t (h t ).The goal is to estimate a DTR that when applied in a population of patients of interest, the expected outcome is maximized.Define the value of a fixed regime π as V π E π (Y ), where E π denotes the expectation when treatment is assigned according to the policy π.The optimal treatment regime, π opt , maximizes the value function: In the next two sections, we explain how an optimal regime can be estimated from data using Q-learning and IQ-learning.The IQ-learning estimated optimal decision rules will be denoted by π IQ−opt t and the Q-learning analogs by π Q−opt t .Both methods are implemented in the iqLearn package.

Q-learning
Q-learning (Watkins, 1989;Watkins and Dayan, 1992;Murphy, 2005b) is an approximate dynamic programming algorithm that can be used to estimate an optimal DTR from obser-vational or randomized study data.Define the Q-functions: The Q-function at stage two measures the Quality of assigning a 2 to a patient presenting with history h 2 .Similarly, Q 1 measures the quality of assigning a 1 to a patient with h 1 , assuming an optimal decision rule will be followed at stage two.Were the Q-functions known, dynamic programming (Bellman, 1957) gives the optimal solution, π opt t Since the underlying distribution of the patient histories is not known, the conditional expectations that define the Q-functions are unknown and must be approximated.Q-learning approximates the Q-functions with regression models; commonly linear models are chosen in practice because they yield simple, interpretable models.We will consider linear models of the form: , where h t0 and h t1 include an intercept and a subset of variables collected in h t .Define is the predicted future outcome assuming the optimal decision is made at stage two.

Q3. Modeling:
Regress Y on H 10 , H 11 , A 1 to obtain The t th -stage optimal decision rule then assigns the treatment a t that maximizes the estimated In Q-learning with linear models, this can be written as The first modeling step in the Q-learning algorithm is a standard multiple regression problem to which common model building and model checking techniques can be applied to find a parsimonious, well-fitting model.The absolute value in the definition of Y arises when A 2 is coded as {−1, 1}, since arg max a 2 Q 2 (H 2 , a 2 ; β 2 ) = sign(H ⊺ 21 β 21 ).The second modeling step (Q3) requires modeling the conditional expectation of Y .This can be written as Due to the absolute value function, Y is a nonsmooth, nonmonotone transformation of H 2 .Thus, the linear model in step Q3 is generally misspecified.In addition, the nonsmooth, nonmonotone max operator in step Q2 leads to difficult nonregular inference for the parameters that index the first stage Q-function (Robins, 2004;Chakraborty et al., 2010;Laber et al., 2010;Song et al., 2011;Chakraborty et al., 2013).In the next section, we develop an alternative to Q-learning, which we call IQ-learning, that addresses the applied problem of building good models for the first-stage Q-function and avoids model misspecification for a large class of generative models.

Interactive Q-learning (IQ-learning)
IQ-learning differs from Q-learning in the order in which maximization step (Q2 in the Q-learning algorithm) is performed.We demonstrate how the maximization step can be delayed, enabling all modeling to be performed before this nonsmooth, nonmonotone transformation.This reordering of modeling and maximization steps facilitates the use of standard, interactive model building techniques because all terms to be modeled are linear, and hence smooth and monotone, transformations of the data.For a large class of generative models, IQ-learning more accurately estimates the first-stage Q-function, resulting in a higher-quality estimated decision rule (Laber et al., 2013).Another advantage of IQ-learning is that in many cases, conditional mean and variance modeling techniques (Carroll and Ruppert, 1988) offer a nice framework for the necessary modeling steps.These mean and variance models are interpretable, and the coefficients indexing them enjoy normal limit theory.Thus, they are better suited to inform clinical practice than the misspecified first-stage model in Q-learning whose indexing parameters are nonregular.However, the mean-variance modeling approach we advocate here is not necessary and other modeling techniques may be applied as needed.Indeed, a major advantage and motivation for IQ-learning is the ability for the seasoned applied statistician to build high-quality models using standard interactive techniques for model diagnosis and validation.IQ-and Q-learning do not differ at step one (Q1), which we refer to as the second-stage regression.Define m(H 2 ; β 2 ) H ⊺ 20 β 20 , and ∆(H 2 ; β 2 ) H ⊺ 21 β 21 .We call the first term the main effect function and the second the contrast function.∆(H 2 ; β 2 ) "contrasts" the quality of the second-stage treatments: ∆(H 2 ; In the IQ-learning framework, the first-stage Q-function is defined as where g(• | h 1 , a 1 ) is the conditional distribution of the contrast function ∆(H 2 ; β 2 ) given H 1 = h 1 and A 1 = a 1 .In fact, (2) is equivalent to the representation of Q 1 in (1), only the conditional expectation has been split into two separate expectations and the second has been written in integral form.Instead of modeling the conditional expectation in (1) directly, IQ-learning separately models E(m(H 2 ; β 2 )|H 1 = h 1 , A 1 = a 1 ) and g(• | h 1 , a 1 ).Although IQ-learning trades one modeling step (Q3) for two, splitting up the conditional expectation in (1) is advantageous because the terms that require modeling are now smooth, monotone functionals of the data.The maximization occurs when the integral in (2) is computed, which occurs after the conditional density g(• | h 1 , a 1 ) has been estimated.The IQ-learning algorithm is given below.
IQ-learning Algorithm: IQ4. Maximization: Combine the above estimators to form The IQ-learning estimated optimal DTR assigns the treatment at stage t as the maximizer of the estimated stage-t Q-function π IQ−opt t (h t ) = arg max at Q IQ t (h t , a t ; β t ).We note that it is possible to obtain QIQ 1 in IQ4 by modeling the bivariate conditional distribution of m(H 2 ; β 2 ) and ∆(H 2 ; β 2 ) given H 1 and A 1 instead of separate modeling steps IQ2 and IQ3.However, it is often easier to assess model fits using standard residual diagnostics and other well-established model checking tools when E(H ⊤ 20 β 20 | H 1 , A 1 ) and g(• | H 1 , A 1 ) are modeled separately.

Remark about density estimation in IQ3
Step IQ3 in the IQ-learning algorithm requires estimating a one-dimensional conditional density.In Laber et al. (2013) we accomplish this using mean-variance, location-scale estimators of g(• | h 1 , a 1 ) of the form , and φ is an estimator of the density of the standardized residuals {∆(H 2 ; β 2 ) − µ(h 1 , a 1 )} /σ(h 1 , a 1 ), say φ h 1 ,a 1 , which we assume does not depend on the history h 1 or the treatment a 1 .Meanvariance function modeling tools are well-studied and applicable in many settings (Carroll and Ruppert, 1988).Currently, iqLearn implements mean-variance modeling steps to estimate g(• | h 1 , a 1 ) with the option of using a standard normal density or empirical distribution estimator for φ. 3 Using the iqLearn Package

Preparing dataset bmiData
The examples in this section will be illustrated using a simulated dataset called bmiData which is included in the iqLearn package.The data are generated to mimic a two-stage SMART of body mass index (BMI) reduction with two treatments at each stage.The variables, treatments, and outcomes in bmiData were based on a small subset of variables collected in a clinical trial studying the effect of meal replacements (MRs) on weight loss and BMI reduction in obese adolescents; see Berkowitz et al. (2010) for a complete description of the original randomized trial.Descriptions of the generated variables in bmiData are given in Table (1).Baseline covariates include gender, race, parent_BMI, and baseline_BMI.
Four-and twelve-month patient BMI measurements were also included to reflect the original trial design.In the generated data, treatment was randomized to meal replacement (MR) or conventional diet (CD) at both stages, each with probability 0.5.In the original study, patients randomized to CD in stage one remained on CD with probability one in stage two.Thus, our generated data arises from a slightly difference design than that of the original trial.In addition, some patients in the original data set were missing the final twelve month response as well as various first-and second-stage covariates.Our generated data is complete, and the illustration of IQ-and Q-learning with iqLearn that follows is presented under the assumption that missing data have been addressed prior to using these methods (for example, using an appropriate imputation strategy).
After installing iqLearn, load the package:

> library (iqLearn)
Next, load bmiData into the workspace with

> data (bmiData)
The generated dataset bmiData is a data frame with 210 rows corresponding to patients and 8 columns corresponding to covariates, BMI measurements, and assigned treatments.

> dim (bmiData)
[1] 210 8 > head (bmiData) We use the negative percent change in BMI at month 12 from baseline as our final outcome: Thus, higher values indicate greater BMI loss, a desirable clinical outcome.We will next show how to implement IQ-learning with the iqLearn package to obtain an estimate of the optimal DTR, π ), that maximizes the expected BMI reduction.

IQ-learning functions
The current version of the iqLearn package only allows specification of linear models at all modeling steps.An advantage of IQ-learning over Q-learning is that for a large class of generative models, linear models are correctly specified at each modeling step (Laber et al., 2013).In general, this is not true for Q-learing at the first-stage.In our illustrations, we skip some of the typical exploratory techniques that a careful analyst would employ to find the best-fitting models.These steps would not be meaningful with the bmiData dataset since it was simulated with linear working models and would only detract from our main focus which is to present the steps of the IQ-learning algorithm using the functions in iqLearn.
Analysts who use IQ-learning should employ standard data exploration techniques between each modeling step.Another consequence of using generated data is that we will not intrepret any coefficients or comment on model fit.In fact, most of the R 2 statistics are nearly 1 and many terms appear highly significant, reflecting the fact that the data are not real.All models and decision rules estimated in this section are strictly illustrative.In addition, the results in this section are not representative of the results of the original meal replacement study.

STEP IQ1: second-stage regression
The first step in the IQ-learning algorithm is to model the response as a function of second-stage history variables and treatment.We model the second-stage Q-function as a linear function of gender, parent_BMI, month4_BMI, and A2, fitting the model using least squares.

> fitIQ2 = learnIQ2 (y ~gender + parent_BMI + month4_BMI + +
A2*(parent_BMI + month4_BMI), data=bmiData, treatName="A2", + intNames=c ("parent_BMI", "month4_BMI")) The function learnIQ2() creates an object of type learnIQ2 that contains a lm() object of the linear regression in addition to several other components.We have implemented the formula specification above.The user can specify any formula admissible by lm(), but it must include the main effect of treatment A2 and at least one treatment interaction term.The second and third arguments specify which variable codes the second stage treatment and covariates interacting with treatment respectively.If exploratory work suggests there are no treatment-by-covariate interactions at the second stage, IQ-learning has no advantage over Q-learning, and it would be appropriate to model the conditional expectation of Y directly at the first stage.The default S3 method for learnIQ2() requires a matrix or data frame of variables to use as main effects in the linear model.Below, we create this data frame.
> s2vars = bmiData [, c(1,3,5) The default method also requires a vector of indices that point to the columns of s2vars that should be included as treatment interactions in the model.

> s2ints = c (2,3)
The default method for learnIQ2() is To print the regression output we can call a summary() of the learnIQ2 object.

> summary (fitIQ2)
Stage 2 Regression: The plot() function can be used to obtain residual diagnostic plots from the linear regression, shown in Figure 2.These plots can be used to check the usual normality and constant variance assumptions.The learnIQ2 object returns a list that contains the estimated main effect coefficients, > fitIQ2$betaHat20     The first term of $betaHat20 is the intercept and the first term of $betaHat21 is the main effect of treatment A2.Other useful elements in the list include the vector of estimated optimal second-stage treatments for each patient in the dataset ($optA2), the lm() object ($s2Fit), the vector of estimated main effect terms ($main), and the vector of estimated contrast function terms ($contrast).

STEP IQ2: main effect function regression
The next step in the IQ-learning algorithm is to model the conditional expectation of the main effect term given first-stage history variables and treatment.We accomplish this by regressing {H ⊺ 20,i β 20 } n i=1 on a linear function of {H 1,i , A 1,i } n i=1 using the function learnIQ1main() which creates an object of type learnIQ1main.The learnIQ1main() function extracts the estimated vector of main effect terms from the learnIQ2 object to use as the response variable in the regression.The user can specify any right-hand sided formula admissible by lm(), but it must include the main effect of treatment A1.If no treatment interactions are desired, intNames can be omitted or specified as NULL (the default).The default S3 method for learnIQ1main() requires a matrix or data frame of variables to use as main effects in the linear model.Below, we create this data frame.The default method also requires a vector of indices that point to the columns of s1vars that should be included as treatment interactions in the model.If no interactions are desired, s1mainInts can be omitted, as the default is NULL.

> s1mainInts = c (1,3)
The default method for learnIQ1main() is where the first argument is the learnIQ2 object.Again, plot() gives residual diagnostic plots from the fitted regression model, shown in Figure 3. Elements of the list returned by learnIQ1main() include the estimated main effect coefficients,     Other elements are used in future steps of the algorithm.

STEP IQ3: contrast function density modeling
The final modeling step in IQ-learning is to model the conditional density of the contrast function given first-stage history variables and treatment.We will accomplish this by considering the class of location-scale density models and employing standard conditional mean and variance modeling techniques.Thus, we begin by modeling the conditional mean of the contrast function using learnIQ1cm().The user can specify any right-hand sided formula admissible by lm(), but it must include the main effect of treatment A1.The default S3 method for learnIQ1cm() requires a matrix or data frame of variables to use as main effects in the linear model and indicies indicating the treatment interaction effects.intNames can be omitted or specified as NULL if no interactions are desired.We will use s1vars and specify the interactions with a vector for s1cmInts.

> s1cmInts = c (1,3,4)
The default method is 0.00 0.05 0.10 0.15 0.20 Other items in the list are used in upcoming steps of the algorithm.

Residuals vs Leverage
After fitting the model for the conditional mean of the contrast function, we must specify a model for the variance of the residuals.Standard approaches can be used to determine if a constant variance fit is sufficient.If so, is the default for estimating the common standard deviation.Equivalently, method='homo' can be specified to indicate homoskedastic variance, > fitIQ1var = learnIQ1var (object=fitIQ1cm, method="homo") but this additional statement is unnecessary since it is the default.A list is returned with the estimated common standard deviation of the contrast mean fit residuals ($stdDev), the vector of standardized residuals for each patient in the dataset ($stdResids), and several other elements, some of which are NULL when method='homo'.
If the variance is thought to be non-constant across histories H 1 and/or treatment A 1 , the option method='hetero' allows specification of a log-linear model for the squared residuals.As before, the formula should be only right-hand sided and must include the main effect of treatment A1.The default for s1varInts is NULL, which can be used if no interactions are desired in the model.The formula version and alternate default specification are shown below and are similar to previous steps.
Other elements in the list are used in the next IQ-learning step.The final step in the conditional density modeling process is to choose between the normal and empirical density estimators.Based on empirical experiments (see Laber et al., 2013), we recommend choosing the empirical estimator by default, as not much is lost when the true density is normal.However, iqResids() can be used to inform the choice of density estimator.The object of type iqResids can be plotted to obtain a normal QQ-plot of the standardized residuals, displayed in Figure 6.If the observations deviate from the line, dens='nonpar' should be used in the final IQ-learning step, IQ4.

Residuals vs Leverage
A vector of estimated optimal first-stage treatments for patients in the study is returned ($optA1).
Recommend treatment with IQ1() and IQ2() After estimating the optimal regime using the IQ-learning algorithm, the functions IQ1() and IQ2() can be used to recommend treatment for future patients.To determine the recommended first-stage treatment for a patient with observed history h 1 , we must form vectors h1main, h1cm, and h1var that match the order of main effects in each of the corresponding first-stage modeling steps.We suggest checking summary() for each of the first-stage modeling objects to ensure the new patient's history vectors have the correct variable ordering.If the 'homo' option was used to fit a constant variance, h1var can be left unspecified or set to NULL.In our examples, the main effects used in each of the three first-stage modeling steps all happened to be the same variables in the same order.Thus, in this example h1main, h1cm, and h1var are equivalent.
As displayed above, a list is returned by IQ1() that includes the value of the first-stage Q-function when A 1 = 1 ($q1Pos) and A 1 = −1 ($q1Neg) as well as the recommended first-stage treatment for that patient, $q1opt.
For a patient with second-stage history h 2 , we only need to check the order of the main effects in the second-stage regression and form a corresponding vector based on the new patient's observed history.
Similar to IQ1, a list is returned that contains the value of the second-stage Q-function when A 2 = 1 ($q2Pos) and A 2 = −1 ($q2Neg) as well as the recommended second-stage treatment, $q2opt).

Q-learning functions
For convenience, when a comparison of IQ-and Q-learning is desired, functions are available in iqLearn to estimate and recommend optimal treatment strategies using Q-learning.Function qLearnS2() implements the second-stage regression in the same manner as learnIQ2(), with the minor exception that a treatment-by-covariate interaction is not required but rather only the main effect of treatment A2.Examples of the default and formula implementations are given below.In addition, Y can be accessed from qLearnS2 with $Ytilde, and the lm() objects at each stage are also included ($s2Fit and $s1Fit).Finally, the qLearnS1 object contains a vector of estiamted optimal first-stage treatments for patients in the dataset ($optA1), and the qLearnS2 object contains the corresponding second-stage vector ($optA2).

> fitQ2 = qLearnS2 (H2=s2vars
To recommend the Q-learning estimated optimal treatments for a new patient based on observed histories, functions qLearnQ1() and qLearnQ2() are available and are similar to IQ1() and IQ2().They require the observed history vectors for the new patient to have the same variables in the same order as the main effects in the regressions used to build the Q-learning regime.Checking the summary() of the Q-learning objects is recommended to ensure the histories are set up properly.Examples are given below.Elements in the returned lists are the same as those returned by IQ1() and IQ2().

Estimating Regime Value
We may wish to compare our estimated optimal regime to a standard of care or constant regime that recommends one treatment for all patients.One way to compare regimes is to estimate the value function.A plug-in estimator for V π is , where Y i is the i th patient's response, (A 1i , A 2i ) the randomized treatments and (h 1i , h 2i ) the observed histories.This estimator is a weighted average of the outcomes observed from patients in the trial who received treatment in accordance with the regime π.It is more commonly known as the Horvitz-Thompson estimator (Horvitz and Thompson, 1952).The function value() estimates the value of a regime using the plug-in estimator and also returns value estimates corresponding to four non-dynamic regimes: $valPosPos (π 1 = 1, π 2 = 1); $valPosNeg (π 1 = 1, π 2 = −1); $valNegPos (π 1 = −1, π 2 = 1); and $valNegNeg (π 1 = −1, π 2 = −1).value() takes as input d1, a vector of first-stage treatments assigned by the regime of interest; d2, a vector of second-stage treatments assigned by the regime of interest; Y, the response vector; A1, the vector of first-stage randomized treatments received by patients in the trial; and A2, the vector of second-stage randomized treatments.

Conclusion
We have demonstrated how to estimate an optimal two-stage DTR using the IQ-learning or Q-learning functions and tools in the R package iqLearn.As indicated by its name, Interactive Q-learning allows the analyst to interact with the data at each step of the IQlearning process to build models that fit the data well and are interpretable.At each model building step, the IQ-learning functions in iqLearn encourage the use of standard statistical methods for exploratory analysis, model selection, and model diagnostics.
Future versions of iqLearn will implement more general model options, in particular, the ability to handle data with more than two treatments at each stage.

Figure 1 :
Figure 1: SMART design toy example with two randomized stages and two treatment options at each stage.Patients progress from left to right and are randomized to one of two treatment options just prior to Stages 1 and 2. Randomizations are represented by gold circles; treatments are displayed in blue boxes.

Figure 2 :
Figure 2: Residual diagnostic plots from the second-stage regression in IQ-learning.

Figure 4 :
Figure 4: Residual diagnostic plots from the linear regression model for the contrast function mean.

Figure 5 :
Figure 5: Residual diagnostic plots from the log-linear variance model.

Figure 6 :
Figure 6: Normal QQ-plot of the standardized residuals obtained from the contrast mean and variance modeling steps.

Table 1 :
Description of variables in bmiData.
It can be implemented with either a right-hand sided formula specification or the default method.Both options are demonstrated below.It is necessary to include the main effect of treatment A1, but s1ints (intNames in the formula version) can be omitted or specified as NULL if no interactions are desired in the model.Both qLearnS2 and qLearnS1 objects hold lists that include the estimated parameter vectors for the main effects and treatment interactions.
Methods summary() and plot() can be used in the same way as in the IQ-learning section; see discussion of learnIQ2() for more details and examples.The function that estimates the first-stage Q-function is qLearnS1().