D-learning to estimate optimal individual treatment rules

: Recent exploration of the optimal individual treatment rule (ITR) for patients has attracted a lot of attentions due to the potential heterogeneous response of patients to diﬀerent treatments. An optimal ITR is a decision function based on patients’ characteristics for the treatment that maximizes the expected clinical outcome. Current literature mainly focuses on two types of methods, model-based and classiﬁcation-based methods. Model-based methods rely on the estimation of conditional mean of outcome instead of directly targeting decision boundaries for the optimal ITR. As a result, they may yield suboptimal decisions. In contrast, although classiﬁcation based methods directly target the optimal ITR by converting the problem into weighted classiﬁcation, these methods rely on using correct weights for all subjects, which may cause model misspeciﬁcation. To overcome the potential drawbacks of these methods, we propose a simple and ﬂexible one-step method to directly learn (D-learning) the optimal ITR without model and weight speciﬁcations. Multi-category D-learning is also proposed for the case with multiple treatments. A new eﬀect measure is proposed to quantify the relative strength of an treatment for a patient. We show estimation consistency and establish tight ﬁnite sample error bounds for the proposed D-learning. Numerical studies including simulated and real data examples are used to demonstrate the competitive performance of D-learning.


Introduction
Precision Medicine has recently gained increasing attention in scientific research. The goal of precision medicine is to identify the optimal individual treatment rule (ITR) by considering the patients' heterogeneity, such as demographics, background and genetic information, to maximize each patient's expected clin-3602 Z. Qi and Y. Liu ical outcome. Mathematically speaking, ITR is a function mapping from the covariate space into the treatment space.
There is a fast growing literature on estimating ITRs based on observational studies or randomized clinical trials. Existing approaches could be categorized into two main types, model-based and classification-based methods. Q-Learning ( [33], [23], [26], [24]) and A-learning ( [22], [25]) are two representative modelbased methods in precision medicine. Q-learning models the conditional mean of clinical outcomes given the patients' covariates and treatment, while A-learning, which can be more robust to model misspecification than Q-learning, models the contrast function of outcome between treatments. Recently, [35] proposed a doubly robust augmented inverse propensity score weighted (IPSW) estimator to estimate the ITR. [30] proposed a modified covariates regression method to estimate the ITR, and [9] proposed a new concordance-assisted learning method to estimate the optimal ITR.
As an interesting alternative approach, the classification-based method was first proposed by [37]. They showed that maximizing the individual clinical outcome is equivalent to minimizing a weighted classification error, and they proposed the outcome weighted learning (OWL) method by using clinical outcomes as the classification weights. To further improve the finite sample performance of OWL, [20] proposed an augmented OWL method, and [39] considered a residual weighted learning method (RWL). Tree-based methods ( [16,7]) under the classification framework were considered to enhance the interpretability of ITR.
In precision medicine, variables that have qualitative interactions with the treatment are called prescriptive variables ( [12]). Correctly identifying prescriptive variables can help save time and the cost of collecting unnecessary information in clinical practice. Although modern variable selection techniques have been used in model-based methods, they mainly focus on variables for prediction and may neglect the prescriptive variables that have weak predictive power but are important for decision making. This may cause the mismatch between predicting clinical outcomes and optimizing ITRs for model-based methods. [12] and [8] proposed methods using an additional step to fill the gap between prescriptive variables and prognostic variables.
For classification-based methods, OWL effectively formulated the problem as a weighted classification framework ( [37]). Despites its success, OWL may be affected by a constant shift of the clinical outcome and tends to keep the treatment assignments that patients actually received in randomized trials ( [39]). RWL by [39] uses a model-based method to compute the weights to improve the finite sample performance of OWL and also incorporates a variable selection procedure. Although RWL could alleviate some potential issues of OWL, it may suffer from the potential main effect model misspecification problem in calculating residuals as the weights for classification. In addition, the computational cost of RWL can be high due to the use of the non-convex ramp loss function, especially when the dimension is large. Recently, [27] proposed a sparse OWL under the classification framework for variable selection. Despite these existing methods, more developments are needed for effective ITR estimation.
In this paper, we propose a novel one-step method to directly learn (Dlearning) the optimal ITR without specifying the main effect model and weights for both binary and multiple treatment settings, and simultaneously perform variable selection on prescriptive variables for linear models. The extensions to nonlinear models by kernel regression are discussed as well. The proposed Dlearning is very simple and flexible. It combines the advantages of both modelbased and classification-based methods. Furthermore, we propose a new measure to quantify the relative strength of all treatments. Such a measure can provide additional information among the treatments beyond ITRs for doctors and patients to make better decisions. We also present comprehensive theoretical results of D-learning under both linear and nonlinear models.
The remainder of this paper is organized as follows. In Section 2, we briefly review some existing methods and introduce D-learning for estimating the optimal ITR. In Section 3, we establish estimation consistency and convergence rates of D-learning under various settings. In Section 4, we conduct an extensive simulation study to evaluate D-learning by comparing it with several alternative methods. In Section 5, we analyze acquired immune deficiency syndrome (AIDS) randomized clinical trial data ( [13]) using D-learning and compare with several other alternative methods. We conclude the paper with some discussion in Section 6.

Direct learning for individual treatment rules
For notation, we use boldface capital and lowercase symbols to denote matrices and vectors respectively, with the exception of the random vector X defined below. We first consider the framework of a binary treatment randomized trial. For each patient, we observe a treatment A ∈ A = {1, −1}, the baseline information X = (1, X 1 , · · · , X p ) T ∈ X , treatment assignment of patients during the study and the clinical outcome R after receiving the treatment. Without loss of generality, we assume the larger R is, the better condition a patient is in. The treatment rules are the set of deterministic decision functions that map the patient's covariate space into the treatment space. We define π(a, x) = P[A = a|X = x] to be the probability of a patient being assigned to treatment a conditioning on the covariates X under the randomized trial framework. For observational studies, π(a, x) denotes the propensity score and can be estimated via various methods such as logistic regression. We assume π(a, x) > 0 for any a ∈ A, given X ∈ X almost surely. For simplicity, we assume π(a, x) is known for our discussion.
An optimal ITR is defined as the decision rule that maximizes the expected clinical outcome among all candidate rules. According to [24], the expected clinical outcome under the rule d could be written as where I(•) is the indicator function. This quantity is called the value function, which we denote as V (d) associated with the treatment rule d. Then the optimal rule is defined for a specific class of treatment rules D. Before introducing D-learning, we highlight three previous methods for comparison. [24] proposed l 1 -PLS as one of fundamental model based methods. They developed a two-stage procedure to estimate the optimal ITR. The first step is to compute the conditional mean of the clinical outcome, E[R|X, A], given covariates and treatment by using l 1 penalized regression. The second stage is to compare this computed conditional mean under different treatments to derive the estimated ITR. The l 1 -PLS method is effective by using a rich class of functions to approximate the conditional mean of R, and is interpretable by including the variable selection procedure. However, l 1 -PLS does not directly target on prescriptive variables and the performance guarantee relies on the correctness of model specification. The implicit goal of l 1 -PLS focuses on the prediction accuracy of the conditional clinical response which may be suboptimal in optimizing the decision rule.

Previous related methods
To avoid modeling R directly, [37] proposed a very interesting OWL method to maximize (2.1) under the weighted classification framework by showing They use the convex hinge loss in the Support Vector Machine ( [5]) to substitute the 0-1 loss function in (2.3). In particular, OWL is to find an optimal ITR by minimizing 1 n where (·) + = max(·, 0) and ||f || is some norm of function f . Although OWL directly targets on the prescriptive variables and could possibly make a better decision, it requires positive R to compute and the resulting treatment rule tends to keep treatment assignments that the patients actually received in the randomized trial ( [39]). [39] recently proposed RWL to improve the finite performance of OWL by developing a two-step procedure. The first step is to calculate the residuals as weights r i by regressing R on X. After calculating the weights which can be negative, they used a non-convex ramp loss function T to compute the optimal ITR under the classification framework. In particular, RWL aims to minimizing An effective difference of convex algorithm was applied in RWL. However, RWL may require relative large computational costs especially in high dimensional or nonlinear decision boundary settings. More importantly, RWL highly relies on the correct calculation of the residuals and may fail if the model for residual calculation in the first step is misspecified. Furthermore, additional variability may be introduced by using a two-step procedure. Both OWL and RWL were designed for binary treatment settings. In Section 2.2, we propose an effective method to combine the strengths of both model-based and classification-based methods.

D-learning
Our goal is to estimate the optimal ITR by a single step and reduce the risk of model misspecification. In particular, model-based methods impose certain regression model assumptions to fit E[R|X, A]. For the classification based method RWL, calculating the residual r i is needed before the weighted classification step. One of the main motivations of D-learning is to avoid such model assumptions.
As shown in both [24] and [37], we could rewrite the optimal ITR in (2.2) as Note that the optimal decision function f 0 (X) could be further written as (2.5) We observe that (2.5) gives us a direct way to learn the optimal ITR by estimating the conditional expectation E[ RA π(A,X) |X]. Then the estimated ITR is based on the sign of the estimated E[ RA π(A,X) |X]. In contrast to OWL which requires all rewards to be nonnegative, f 0 (X) is invariant to a constant shift for the reward R. If R is replaced by R + c(X) for any random variable c(X) depending only on X, then |X] = f 0 (X). (2.6) Using the results in (2.5), our D-learning estimates f 0 (X) and uses sign(f 0 (X)) as the estimated ITR. To further understand D-learning, we would like to point out an interesting interpretation. Assume that the clinical outcome R could be expressed as where W is the mean zero random error term, m(X) is the main effect of covariates X for both treatments. Then one can see that f 0 (X) = E[ RA π(A,X) |X] = 2δ(X), which captures the heterogeneity between the two treatment effects without the need of estimating m(X). The general form of R for a multiple treatment setting can be represented as The next lemma provides us a way to estimate f 0 (X).

Lemma 1. Under the assumption of interchange between differentiation and
. Lemma 1 offers a simple way to estimate f 0 (X). Specifically, there are many existing regression methods we can adopt to estimate f 0 (X). In the next two subsections, we consider both linear and nonlinear D-learning to estimate ITR.

D-learning for linear decision rules
We first consider the setting of a two-arm randomized trial. Assume we observe independent identically distributed triplet data {(A i , X i , R i ); i = 1, · · · , n}. If we consider F := {f (X) = X T β, β ∈ R p } to be the class of linear functions to approximate f 0 (X), then the ordinary least square (OLS) estimator is given bŷ However, in high dimensional settings especially when p > n, OLS may fail to estimate the parameter vector β with risk of potential over-fitting. Therefore, it is desirable to perform sparse regularization to improve the prediction accuracy and interpretability of the model. In our case, this will not only help us identify the crucial prescriptive variables but also enhance the estimation accuracy of the decision boundary. There are many linear regression techniques in the literature. The Least Absolute Shrinkage and Selection Operator (LASSO) is one of the most famous variable selection methods in high dimensional regression and the estimator is given by: where λ is the tuning parameter for the l 1 penalty. Then the resulting ITR is given byd n (X) = sign(f 0(n) (X)) = sign(X Tβlasso n ). (2.10) Besides the LASSO, one could also use other convex penalties such as the elastic net ( [40]), and non-convex penalties such as SCAD ( [10]) and MCP ( [36]).

D-learning for nonlinear decision rules
While linear D-learning is simple and interpretable, more complex nonlinear decision rules may be necessary sometimes in practice. Thus we consider two nonlinear methods to estimate f 0 (X), Kernel Ridge Regression (KRR) ( [6]) and the Component Selection and Smoothing Operator (COSSO) ( [19]). KRR combines the kernel trick in machine learning with ridge regression by optimizing the following objective function to estimate the decision rule where H K denotes the Reproducing Kernel Hilbert Space (RKHS) and • H K is the corresponding norm. By the representer theorem ( [15]), we can express the function f (X i ) = n j=1 c j K(X i , X j ), where K denotes the n × n kernel matrix and K(X i , X j ) denotes the (i, j)-entry of K. The RKHS norm could be evaluated as ||f || 2 H K = c T Kc. Then the optimization is equivalent to optimizing over c ∈ R n via min c∈R n where [Kc] i denotes the i-th element of the vector Kc. Note that KRR maps the covariates into an infinite dimensional space and is computationally efficient due to the kernel trick. However, it does not provide model selection for nonlinear function estimation. In order to perform model selection in nonlinear function estimation, we use the COSSO proposed by [19] based on the smoothing spline analysis of variance (SS-ANOVA) model ( [32]). The SS-ANOVA model can be written as where α is the intercept, X i = (x 1 , · · · , x p ) T , f i 's are baseline functions and f jk 's are two-way interactions, etc. Then the functional space in the SS-ANOVA model could be written as If the model is only related to baseline functions, then d = p. For a two-way interaction model, where P i f is the orthogonal projection of f on H i and || • || is the pseudonorm in RKHS. [19] showed that (2.13) could be solved efficiently via alternative minimization methods.
As a remark, we would like to point out that [30] used a similar formula as (2.9) for estimating interactions between a treatment and covariates while the working model is linear. However, in addition to linear learning, we also propose D-learning for nonlinear decision rules, which can capture a broader class of functions. This is not equivalent to fitting a kernel regression using the formulation in [30]. More importantly, in Section 2.3, we extend our proposed D-learning to handle multiple treatments in ITR problems.

Multi-category D-learning
Most existing methods for estimating optimal ITRs only focus on binary treatment settings. However, in practice, it often has applications with multiple treatment settings ( [1]). To the best of our knowledge, few research attempts settings with more than two treatments. In this section, we consider a K-armed treatment setting, where A ∈ A = {1, · · · , K}. For simplicity, we focus on the class of linear decision rules to approximate the optimal ITR while the extension to nonlinear setting can be derived similarly. In addition, the notations and assumptions remain the same as before.
Expanding the equation (2.1), for K treatments we can get: Similar to the binary treatment setting, the optimal ITR is given by (2.14) Without changing the order of each element, we can further write the optimal ITR as where A ki is a binary random variable with π ki (1, where we can use previously proposed D-learning method for estimation. The function f k (X), which we name as the effect measure of treatment k, is the sum of pairwise decision functions between the k-th treatment over other treatments.
One of the key differences between our method and fitting a regression model with treatment-by-covariate interactions is that D-learning avoids modeling the main effect functions, but directly targets on decision functions. The difference is significant because of the mismatch between minimizing the prediction error and maximizing the value function in finding optimal ITRs. Model-based methods are focused on predicting responses, i.e., minimizing the prediction error. In contrast, although our proposed D-learning also has the regression form, it directly targets on maximizing the value function. This difference can be more substantial for nonlinear learning due to the potential over-fitting.
In practice, among K treatments, one may be interested in learning the relative strengths of all treatments besides the estimated ITR. For example, if two treatments have similar effect measure, the doctor and patient may decide to use the suboptimal one if it costs less or has less potential side effects. Thus, our effect measure provides additional useful information besides the ITR for precision medicine. If there exits multiple treatment options producing the same largest expected treatment effect, other possible outcomes besides the current outcome may be considered for decision making. For example, assume that we also observe Z, the side effect after a patient receives a treatment. Similar to the treatment effect measure f k (X) in our proposed D-learning, we can estimate the side effect measure of each treatment, such as g k (X), for each patient by the multi-category D-learning method. Then we can use f k (X) + τ g k (X) to decide the best treatment for each patient, where τ is used to balance these two outcomes and can be decided by doctors. In theory, if multiple treatments have the same largest expected outcome, these treatments are equivalent in terms of consistency. However, one can pick a treatment among these equivalent ones using other factors, such as costs, etc.
Our proposed multi-category D-learning first estimates f ki (X) separately for 1 ≤ i < k ≤ K. Then the decision rule is based on the maximum of estimated effect measuresf k(n) (X) for k = 1, · · · , K according to (2.15). Here the subscript (n) means that the estimated function depends on the sample size n. We would like to point out that it is sufficient to estimate K(K−1) There is a close relationship between our multi-category D-learning and the standard majority vote One-vs-One (OVO) multi-category classification method. For K-category classification, the OVO multi-category classification method needs to build K(K−1) 2 classifiers, each of which distinguishes a pair of classes i and j for i, j = 1, · · · , K. If the corresponding classifier g ij for the (i, j) pair is positive, then we assign 1 to class i and −1 to class j, otherwise reverse the assignment. Then the final classifier g = argmax i K j=1 sign(g ij ).
The key difference is that this OVO multi-category classification approach uses the majority vote to assign the class label while multi-category D-learning is based on the effect measure as the cumulative pairwise contrast value among all  (2.15), then it becomes similar as the majority vote OVO classification. However, by our derivation, the majority vote is not suitable for D-learning.
As a remark, from (2.15) we observe that K k=1 f k (X) = 0, which implies that one of the effect measures is redundant. This is similar as the sum-ofzero constraint in simultaneous multi-category classification methods, where the constraint is used to guarantee the consistency of classifiers ( [34]).

Tuning parameter selection
For existing methods such as l 1 -PLS, OWL and RWL, the tuning parameter is selected via cross-validation, mainly based on maximizing the empirical value function on validation data defined aŝ where E n denotes the empirical average. For D-learning, since it directly estimates the decision boundary, we consider an alternative way to select the tuning parameter λ. In particular, we select the choice of λ which minimizes the mean square error (MSE) on the validation data. For example, in the binary treatment setting, it is defined as wheref is the estimated decision boundary function. Since it is the same procedure as standard regression techniques such as LASSO, it can be easily implemented via standard software package such as glmnet ( [11]) in R software to estimate the optimal ITR. We would like to point out that unlike tuning criterion (2.16) for previous methods, both our model building and tuning match in the sense that they both use the least square loss.

Theoretical properties of D-learning
In this section, we establish the estimation consistency of D-learning and obtain a value reduction bound of the estimated ITR from the optimal ITR. Fast convergence rates can be achieved under some reasonable assumptions. We consider linear and nonlinear D-learning in Section 3.1 and 3.2 respectively.

Consistency and value reduction bounds under linear decision rules
We first state the generalized Margin condition (gMC) used in our theoretical results.
Assumption (gMC). For any > 0, there exists some constants C > 0 and α > 0 such that Theorem 1. For the estimated effect measuresf k(n) (X) for i = 1, · · · , k and the corresponding ITRd n by multi-category D-learning, we have Furthermore, if we assume the gMC holds, then we could improve the rate as where C depends on the constant C, α and K.

Remark 1. Theorem 1 provides an approach to bound the reduction in the value function by the prediction error
for i, j = 1, · · · , K. Then we could further improve the bound as For the special case when K = 2, Theorem 1 implies where f 0 (X) andf 0(n) (X) are the optimal decision function and corresponding estimation as defined in the previous section. With the gMC assumption, we have where C depends on the constant C and α.
Theorem 1 indicates that our D-learning in minimizing the empirical prediction error can estimate the optimal ITR. In particular, for linear D-learning, we consider F := {f (X) = X T β, β ∈ R p } to be the class of linear functions to approximate f 0 (X). For the remaining theoretical results, without loss of generality, we assume that π(A, X) = 1 2 for any A ∈ A, X ∈ X and K = 2. The resulting prediction error consists of the approximation error and the estimation error since we do not require the true function f 0 (X) lies in F.
Before we characterize the prediction error bound, we say a subset S ⊆ {1, 2, · · · , p} satisfies the compatibility condition ( [3]), if for some constant φ(S) > 0 and for all β ∈ R p , with ||β S c || 1 ≤ 3||β S || 1 , we have where |S| is the cardinality of S, S c is the complement of set S, and the j-th element of β S denoted by β S,j = β j I(j ∈ S). In addition, we use X n to denote the design matrix with n samples. ThenΣ = 1 n X T n X n . Furthermore, we let Ω be the collection of index sets S satisfying the compatibility condition. We provide the finite sample bound for prediction errors.
Before establishing the error bound for the proposed D-learning, we state our assumptions below: for any vector γ.
, and λ is the tuning parameter in (2.9).
Define the oracle β * = argmin β: Then we have following theorem.

Theorem 2. Assume assumptions (i)-(vii) hold. For t > 0, let the tuning parameter be
Then for the constant C 1 which depends on the constants a, ρ and σ 2 and t sufficiently large, with the probability at least 1 − C1 t 2 , we have (3.10) for the constant C 2 determined by the gMC constant, t, φ * , and s * .

Remark 3. The inequality (3.11) gives us the value reduction bound between the optimal ITR and the estimated ITR as
). If the dimension p does not grow exponentially faster than the sample size n, then as n → ∞, . If we fix the dimension p, the convergence rate is at least of order n − 1 2 , i.e., α = 0. If the data are completely separable, then by remark 1, the fast convergence rate of order 1 n can be achieved. Our results are consistent with those derived by [24].

Value reduction bounds under nonlinear decision rules
In this section, we establish the value reduction bounds for the nonlinear Model (2.11). For Model (2.13), we refer to [19] for the finite sample error bounds under the fixed design setting. We assume the outcome R is bounded, i.e., |R| ≤ C 0 almost surely for some constant C 0 and recall that we let π(x, a) = 1 2 as discussed in Section 3.1.
Before proceeding our results, we need the following definitions.

Definition 1.
Consider F to be a class of real value measurable functions f : Z → R. The Rademacher complexity of F is defined as

Z. Qi and Y. Liu
The corresponding empirical Rademacher complexity of F is defined aŝ where we can see that According to Theorem 1, we need to establish bounds for the prediction error under Model (2.11) in order to get the value reduction bound for our estimator. Note that with probability at least 1 − , for some constant c 1 that is independent of n and decreasing in .
Remark 5. This theorem shows that the excess risk converges to 0 in probability under some conditions. The upper bound assumption on the approximation error A(λ n ) is standard in the statistical learning literature such as [29] to derive the convergence rate. Replacing this upper bound assumption by the assumption that X belongs to a compact metric space, convergence of the excess risk to 0 can still be established by the universal consistency property of the Gaussian kernel ( [28]).
for some constant C 3 , which is determined by the gMC constant and c 1 in Theorem 4.

Simulation study
In this section, we conduct extensive simulation studies to investigate D-learning's finite sample performance. The outcome is generated following Model (2.7) with W ∼ N (0, σ 2 ), where σ is set to be 1 or 4. Each covariate is independently generated by a uniform distribution between −1 and 1. The randomized treatment A follows a uniform distribution among treatments. The number of replications is 100 times. We evaluate D-learning under both linear and nonlinear settings in Sections 4.1 and 4.2 for binary treatments, and in Section 4.3 we consider the multi-category setting.

Linear decision boundary study
We consider the sample size n = 100, 400 and the dimension p from 30 to 1920 increased by a factor of 4. We compare D-learning with the following three methods: (1) l 1 -PLS by [24] with basis function (1, X, A, XA); (2) OWL by [37] with linear kernel; (3) RWL by [39] with linear kernel.
The following four linear boundary scenarios are considered: The first scenario was considered in [39]. The second scenario corresponds to the situation that x 3 interacts with the treatment but does not affect the treatment selection strategy. The third scenario represents the situation where prescriptive variables have weak power to predict the outcome but are vital for decision making, in terms of the effective size (ES) defined in [4] We can check that the effective size in the third scenario is small (less than 0.2). Thus it is hard to estimate the optimal ITR correctly, especially for the l 1 -PLS method. The fourth scenario considers the nonlinear main effect m(X) to evaluate the robustness of D-learning. In all four scenarios, optimal ITRs depend on the first two covariates. We use 10-fold cross-validation to select the tuning parameter based on the empirical value function over the validation dataset. We evaluate all the methods based on two criteria. The first criterion is the misclassification rate of the estimated ITR from the optimal ITR based on the independently generated test data. The second criterion is the empirical value function of the estimated ITR on the test data via (2.16). Specifically, we generate 10, 000 independent test data to assess the performance based on these two criteria. Figure 1 shows the misclassification rates for all four scenarios when n = 100 and σ = 1. Additional simulation results are provided in the appendix. Based on the results, we can conclude that for all situations, D-learning has competitive low misclassification error rates than the other three methods, especially when the noise is large. Specifically, for the first linear scenario, D-learning has comparable performance with l 1 -PLS but performs better than OWL and RWL in low dimensions. When the dimension gets higher, D-learning performs the best among all methods. One potential reason of the competitive performance of our proposed D-learning is the effective prescriptive variable selection in high dimensional settings. For the second scenario, our method performs better than l 1 -PLS because our proposed method can effectively identify prescriptive variables that have qualitative interactions with the treatment. For the third scenario, l 1 -PLS performs worse than D-learning and RWL due to the mismatch between prediction and ITR estimation in l 1 -PLS. The interaction effect term in this example has little power in prediction. For the last scenario, since the main effect is non-linear, RWL performs worse than D-learning and l 1 -PLS because of the improper residual calculation due to the model misspecification. Our proposed D-learning does not need to modify the weights. Figure 2 corresponds to the empirical value functions of estimated ITRs on test data for all four scenarios. D-learning performs the best among most scenarios. Finally, we compare D-learning with l 1 -PLS in selecting true prescriptive variables using the average False Positive (FP) and False Negative (FN). Table 1 shows that D-learning has comparable small average FNs as l 1 -PLS but much smaller average FPs than l 1 -PLS for both low and high dimensional problems. We conclude that D-learning can better identify true variables of the optimal ITR for these examples.

Nonlinear decision boundary study
In this subsection, we evaluate D-learning when the true decision boundary is nonlinear. We consider the sample size to be n = 100, 400 and dimension of covariates p = 5, 50. We compare our methods including linear D-learning, Gaussian kernel KKR D-learning, and Gaussian Kernel COSSO D-learning with the following three alternative methods: (1) l 1 -PLS with basis function (1, X, A, XA); (2) RWL with linear kernel; (3) RWL with Gaussian kernel.
We consider the following two nonlinear boundary scenarios: These examples were considered in [39]. The first scenario decision boundary is a parabola while the second scenario corresponds to a circle boundary. From Tables 2 and 3, we can see that, in general, nonlinear D-learning methods perform better than the other three methods, especially when the sample size is large. In addition, our linear D-learning performs competitively even the decision function δ(X) is misspecified. In practice, the choice of kernel functions can be viewed as part of the tuning parameter selection. One can use various kernel functions such as linear, polynomial and Gaussian kernels, and then select the best one using cross-validation. The optimal value is calculated via E d [R] since we know the optimal decision rule in simulation studies.

Multi-category linear decision boundary study
In this subsection, we evaluate our proposed multi-category D-learning when the true decision boundary is linear. For comparison, we extend OWL and RWL   methods to multi-category scenario by using the OVO multi-category classification scheme. Here we consider two versions. One is to use the majority vote OVO multi-category classification method, and the other is to use the procedure similar to the effect measure used for D-learning as discussed in Section 2.3. We name the first version as the hard OVO and the second one as the soft OVO. We consider the sample size to be n = 800, 1600 and dimension of covariates the same as the binary linear decision boundary study. Here we set σ to be 1 and simulation results with large σ are in the appendix. We compare our multicategory D-learning with the following methods: with the following two linear boundary scenarios The only difference between these two scenarios is the functional form of the main effect. Figure 3 shows the misclassification rates for these two scenarios with n = 1600 and σ = 1. Since it is often to have ties for the hard OVO-OWL and OVO-RWL, according to our simulation study, the corresponding results of the hard version are significantly worse than the soft version. We include the comparison between hard and soft methods and additional simulation results in the appendix. Based on the results, we can conclude that for both scenarios, multi-category D-learning has the lowest error rates than the other three methods. Figure 4 corresponds to the empirical value functions of the estimated ITR on the test data for these two scenarios. Our proposed multi-category D-learning  has the largest value functions among most scenarios. Finally, we compare the computational time of D-learning with all other three methods in Table 4. Due to the simplicity of D-learning, our proposed multi-category D-learning has the lowest computational cost in most settings. As the dimension gets large, our proposed multi-category D-learning is even faster than l 1 -PLS since l 1 -PLS needs to fit LASSO regression with Kp + 2 variables and multi-category D-learning only needs to fit K(K−1) 2 separate LASSO regression problems, each with p + 1 variables.

Applications to AIDS clinical data
In this section, we apply D-learning with linear kernel to the data from AIDS Clinical Trials Group (ACTG) 175 ( [13]). This trial was designed to evaluate whether a single treatment of HIV infection is worse than the combination treatments based on the counts of CD4+ T cells of patients. In this trial, 2139 patients with HIV infection were randomly assigned to four treatment groups with the same probability: zidovudine (ZDV) monotherapy, ZDV + didanosine (ddI), ZDV + zalcitabine (ZAL), and ddI monotherapy. We choose the difference between early stage CD4+ T(cells/mm 3 ) cell amount and the baseline CD4+ T prior to trial as the clinical outcome. The larger this change is, in general the better condition the patient is. In addition to the treatment, we consider 12 clinical covariates in our model as done in [21] and [9]. Five of these covariates are continuous: weight (kg), CD4+ T cells amount at baseline, CD8 amount at baseline(cells/mm 3 ), age (year) and Karnofsky score (scale at 0-100). The remaining seven covariates are binary: gender (0=female, 1=male), race (0=white, 1=non-white), homosexual activity (0 = no, 1 = yes), history of intravenous drug use (0=no, 1=yes), symptomatic status (0=asymptomatic, 1=symptomatic), antiretroviral history (0=naive, 1=experienced) and hemophilia (0=no, 1=yes). For the interpretablity of the decision rule, we consider to use linear D-learning to estimate ITR.
For scenario 1, the only non-zero coefficient of the estimated ITR by Dlearning is the intercept with a negative number, implying that the other three treatments are better than ZDV.
For scenario 2, linear D-learning identifies three important prescriptive variables: age, homosexual activity, and baseline CD4+ T cell amount. The estimated ITR is sign(41.38+14.03×age−13.45×baseline CD4+ cell−9.55×homo), which is similar to the result in [21] but identifies one more variable: baseline CD4+ T cell amount. For young patients having homosexual activity experience with high baseline CD4 + T cell amount, the estimated ITR assigns them to the treatment ZDV + Zal. For old patients not having homosexual activity experience with low baseline CD4 + T cell amount, the estimated ITR assigns them to the treatment ZDV + ddI.
For scenario 3, D-learning identifies four important prescriptive variables: age, homosexual activity, baseline CD4+ T cell and the history of intravenous drug use. The estimated ITR is sign(37.19 + 7.94 × age− 25.20 × baseline CD4+ cell − 25.19 × homo + 33.10 × drug), which is similar to the results in [21] and [9] but identifies one more variable: history of intravenous drug use. For young patients having homosexual activity experience with the high baseline CD4 + T cell amount but without using intravenous drug before, the estimated ITR assigns them to the treatment ddI. For old patients not having homosexual activity experience with low baseline CD4 + T cell amount but did use intravenous drug before, the estimated ITR assigns them to the treatment ZDV + ddI.
For scenario 4, D-learning identifies one important prescriptive variable: history of intravenous drug use. The estimated ITR is sign(−10.43 + 9.45 × drug). However, the estimated ITR is always −1, which means the treatment ddI is always preferable to the treatment ZDV + Zal. The efficacy of these two treatments is similar when the patients have the history of using intravenous drug.
To compare D-learning with l 1 -PLS, linear OWL and linear RWL, we report the empirical value functions in Table 5. We split the data into three folds and use two of them as training data and the remaining as test data. The procedure is repeated 1200 times.
From Table 5, we could see that D-learning has the highest empirical value function in the first two scenarios. In the last two scenarios, while D-learning is not the highest, it still performs well (second place). Overall, D-learning performs better than the other three methods in finding the optimal ITR.

Overall comparison
So far, we have considered several binary problems to find the ITRs for each pair of treatments separately. Compared with several alternative methods, the results show that D-learning gives us better decision boundaries for choosing ITRs between two treatments in order to maximize the outcome. Our second goal is to estimate the optimal ITR for considering four treatments simultaneously.  We first compare our proposed multi-category D-learning with l 1 -PLS, linear OVO-OWL and linear OVO-RWL based on the empirical value function. Following the same procedure as before, we random split the data into three folds and use two of them as the training data and the remaining as the test data. The procedure is repeated 1200 times. The results are shown in Table 6. We can see that multi-category D-learning has an significant advantage over other three methods based on the value function. For further investigations, we report coefficient estimation of linear comparison functionsf k , where k = 1, · · · , 4 in Table 7. Besides the important prescriptive variables identified in Section 5.1, multi-category D-learning identifies three more variables: Karnofsky score (scale at 0-100), gender and race, which play a moderate role in deciding the decision rule because the absolute values of such coefficients are not large. Compared with the coefficient estimation of prescriptive variables identified in Section 5.1, the sign of multi-category D-learning coefficient estimation for those prescriptive variables is consistent with the previous study. According to the previous study by [13], treatment with ZDV + ddI, ZDV + Zal or ddI alone slows the pro-gression of HIV disease and is superior to the treatment with ZDV alone. This is consistent with our findings. As found in [14], Zal has most serious adverse event among other two treatments so it is general not recommended. Our finding also supports this suggestion because the absolute value of coefficient estimation of f 3 (X) is relatively small compared with others. In addition, according to VIDEX (didanosine) released by Federal Drug Administration (FDA) http:// www.accessdata.fda.gov/drugsatfda docs/label/2006/020154s50, 20155s39,20156s40,21183s16lbl.pdf, ddI has a significant interaction with allopurinol, which is sold by IV drug and tablet. Thus one should avoid using them together. The large negative coefficient estimation of IV drug in f 4 (X) by D-learning also supports this recommendation.

Discussion
In this article, we propose a direct learning (D-learning) method to estimate the optimal ITR by reformulating the optimal decision function. D-learning is very simple and flexible with highly competitive performance.
The goal of D-learning is to estimate the optimal ITR by combining the advantage of model-based methods such as l 1 -PLS and classification-based methods such as RWL and OWL. Because of its direct formulation, D-learning is robust to potential model misspecification, and can unify the goal of prediction and identification of the optimal ITR. Moreover, D-learning can be applied to multiple treatment settings. The proposed D-learning methods indeed can achieve better performance compared with several existing methods based on the simulation and real data studies.
In this paper, we consider randomized clinical studies where propensity score π(A, X) is assumed to be known. Although it can be estimated by statistical models such as multinomial logistic regression in observational studies, it may suffer from potential model misspecification in estimating propensity scores. This limitation may be addressed by the results in [35]. In their work, by using the augmented inverse probability weighting strategy, their methods have doubly robust properties for observational studies. In particular, either the correct postulated regression model or correct propensity score model gives the consistency results. Another possible approach to relieve the risk of misspecification to propensity score estimation in our proposed methods is to use nonparametric methods, such as regression trees, to estimate π(A, X).
Several possible extensions can be explored for future study. The current framework of D-learning focuses on estimating the single stage optimal ITR. Extensions of D-learning to multiple-stage ITR problems can be obtained. Several methods have been developed to explore the dynamic ITR such as [22], [38] and [20]. We refer [17] for a review. It would be interesting to compare the performance of D-learning in multiple stages with existing methods. Another possible extension is to develop D-learning for other types of outcomes such as binary data and survival time.

More simulation results
For D-learning in the linear boundary case, we consider more situations where σ can be 1 or 4 and sample size is 400. Figure 5 shows misclassification rates for various settings. We can find the similar results as the case of n = 100 and σ = 1. When the σ becomes large, our proposed D-learning has the lowest error rate than the other three methods. The corresponding value functions as shown in Figure 6 is consistent with misclassification results. For multi-category treatment settings, we present the simulation results for the remaining settings such as n = 800, σ = 1, n = 800, σ = 4 and n = 1600, σ = 4. Our proposed multi-category D-learning has the lowest error rate and highest empirical value functions among all the methods as shown in Figures 7 and 8 respectively. In addition, according to our results in Figures 9 and 10, we conclude that the performance of soft OVO-OWL and OVO-RWL is better than hard OVO-OWL and OVO-RWL.

Proof of Lemma 1
Proof. If the differential operator and expectation could exchange, let g(f ) = E 1 π(A,X) (2RA − f ) 2 , and we have Then ∂g(f0) ∂f0 = 0 and by convexity, we have

Proof of Theorem 1
Proof. For any decision rule d, we have By repeating this for K times with different i in the above, we can get Then, we have the value difference to be where the Hölder and Minkowski inequality are used in the last inequality. If Condition (3.1) is assumed as well, then we have α+2 to minimize the above upper bound yields 2+α . (7.1) When K = 2, the result is similar.

Proof of Theorem 2
Lemma 3 (Nemirovski moment inequality). For m ≥ 1 and p ≥ e m−1 , we have the following inequality Next we prove Theorem 2.

Proof of Theorem 3
The results follow directly by combining Theorems 1 and 2.

Proof of Theorem 2
Proof. We first decompose the excess risk into estimation error and approximation error as follows: where the first two terms (I) and (II) are estimation errors. The last inequality is because of the definition off . In order to bound the estimation error, we first obtain the bounds for ||f || H and ||f λn || H correspondingly. By the definition off in Model (2.11), we have where L n (0, 0) = 1 n n i=1 R 2 i ≤ M 1 , since R is bounded by C 0 by assumption. Since L n ≥ 0, we can have λn 2 ||f || 2 H ≤ M 1 and |L n (f )| ≤ M 1 , or equivalently D(f (X i )) = (R i −f (X i )) 2 ≤ M 2 since |f (X i )| ≤ ||f || H sup X∈X K(X, X). By the similar argument, we can also obtain λn 2 ||f λn || 2 H ≤ M 1 and |D(f λn )| ≤ M 2 . Define the following functional class and P n be the corresponding empirical measure on Z n . We first derive the bound for the estimation errors (I) and (II). For the term (I), note that (I) ≤ sup Ξ P D(f ) − P n D(f ), where P is probability measure of (X, A, R). When any (X i , A i , R i ) changes, by the definition of Ξ, sup Ξ P D(f ) − P n D(f ) is changed no more than M2 n . Then by the McDiarmid's inequality, with probability at least 1 − 2 , we can get Note that D(f ) is Lipschitz with constant M 3 over Ξ. By Corollary 3.17 in [18], we have R n (Ξ) ≤ M 3 R n (Π). (7.11) Thus, combining together, with probability 1 − ,

Proof of Theorem 4
Proof. Based on the assumptions and the definition of Π, by Lemma  According to Theorem 2 and assumptions on the approximation error, with probability 1 − , Then optimizing the right hand side with respect to λ n , we can let λ n = O((n − 1 2ω+1 ) and obtain the final result in the sense that with probability at least 1 − , for some constant c 1 decreasing in .

Proof of Corollary 4.1
Proof. The results follow directly by combining Theorems 1 and 4.