Semiparametric single-index model for estimating optimal individualized treatment strategy

: Diﬀerent from the standard treatment discovery framework which is used for ﬁnding single treatments for a homogenous group of patients, personalized medicine involves ﬁnding therapies that are tailored to each individual in a heterogeneous group. In this paper, we propose a new semiparametric additive single-index model for estimating individualized treatment strategy. The model assumes a ﬂexible and nonparametric link function for the interaction between treatment and predictive covariates. We estimate the rule via monotone B-splines and establish the asymptotic properties of the estimators. Both simulations and an real data application demonstrate that the proposed method has a competitive performance.


Introduction
In modern clinical researches, the goal to achieve better outcomes as well as lower cost and burden for individual patients has generated tremendous interest in personalized medicine. Individualized treatment rules (ITRs) operationalize personalized medicine as a decision function from patient's individual biomarkers to a recommended treatment and the optimal ITRs should be the one which maximizes clinical benefit if implemented. Specifically, if we use A to denote treatment assignment taking values of -1 and 1, X to denote all biomarker and prognostic information associated with each patient and let Y be the clinical outcome of interest (assuming large values are desirable), then an individualized treatment rule (ITR), denoted by d(x), takes a given value x of X and provides a treatment choice from {−1, 1}. Furthermore, let P d denote the distribution of (X, A, Y ) and expectation with respect to this distribution by E d , where the individualized treatment rule d(x) is used to assign treatments. Define the value function as V (d) = E d (Y ). Then an optimal ITR, d 0 , is a rule that has the maximal value, i.e., d 0 is the maximizer of V (d) over decision rules d.
There has been growing interest in developing valid inference methods for estimating the optimal ITRs, d 0 , using clinical trial data. With trial data, it holds V (d) = E[Y I(A = d(X))/π(A|X)] [15], where π(a|X) is the known randomization probability of A = a given X, so it is easy to see d 0 (x) = sign{E[Y |A = 1, X = x] − E[Y |A = −1, X = x]}, where sign(·) function is defined as sign(x) = 1 when x > 0, sign(x) = −1 when x < 0. Therefore, most of the existing methods tend to model E[Y |A = a, X = x] including the interactions between the treatment and the covariates either parametrically or nonparametrically. Such literature include likelihood-based approach [19,18,20], parametric Q-learning in [1], and machine learning based methods [25]. Alternatively, one can parametrically model which is called A-learning as discussed in [14] and [16]. Recently, directly maximizing V (d) has been proposed using support vector machine in [26] or via robust parametric models in Zhang et al. [24]. However, all parametric methods potentially suffer from model misspecification especially when X is not lowdimensional and the optimal ITRs depends on high-order interactions among X's. On the other hand, although the nonparametric methods such as machinelearning methods are flexible, the resulting rules are complicated so may not be interpretable in practice. The latter often comes with no rigorous inference procedures as in the parametric methods.
In this paper, we propose a semiparametric single-index model to estimate the optimal ITRs. Our model retains a flexible and nonparametric formulation of the treatment-covariate interactions but also yields a simple decision rule which only depends on a linear combination of X. Specifically, our proposal assumes the following model between Y and (X, A): where X is a p-dimensional covariate vector and may contain 1 as the intercept, β T X is a single index and both μ and ψ are unknown functions. Moreover, ψ is a monotone increasing function with ψ(0) = 0. The proposed model has the following advantages in developing individualized treatment strategy. First, it provides a more flexible interaction between the covariates and the treatment as compared to the traditional parametric models, in which we allow a fully nonparametric baseline function of the covariates X, μ(X), and a close-to nonparametric interaction between the treatment A and the covariates X. Second, we can easily derive the best treatment strategy as d 0 : X −→ sign(ψ(β T X)).
Since ψ is increasing, the resulting rule is practically interpretable. Moreover, if ψ(0) = 0, the above treatment strategy d 0 can be simplified as a simple rule: That is, only the sign of a risk score β T X needs to be evaluated for each patient. As a separate note, single index models have been studied extensively in literature with a number of inference methods developed, including the average derivative method [5], the sliced inverse regression [12,3,11], the iterative average derivative method [6] and other related methods [23]. Estimating both the single index and the link function at the same time has also been studied in [9,8,4]. However, none of these works have considered the single index model for estimating the optimal ITRs, especially that our model (1) assumes the main effect of X, μ(X), to be fully nonparametric. The rest of the paper is organized as follows. In Section 2, we provide a full inference procedure for the proposed semiparametric single index model. Extensive simulation studies are presented in Section 3 and a real data analysis is presented in Section 4, followed by a discussion section.

Inference procedure
Note that model (1) remains the same if we replace ψ(x) by ψ(rx) for any r > 0. Therefore, for identifiability, we further require β = 1 where · is the Euclidean 2 -norm in R p . Assume that data are obtained from a randomized trial with i.i.d observations (Y i , X i , A i ), i = 1, ..., n. The randomization probability P (A = a|X) = π(a|X) is known by the trial design.
To avoid estimating the nonparametric function μ(X) when making inference for β, we first observe that, Therefore, a natural estimate of β is obtained by minimizing the least square, given as subject to β = 1. Since ψ is an increasing function, we approximate ψ(x) using monotone B-spline basis [2,10], where N 1 (x), ..., N Kn+M (x) are B-spline basis, K n is the number of interior knots with equal partition in an interval containing β T X and M is B-spline order, i.e., for cubic B-spline, M = 4. The condition ξ 1 ≤ · · · ≤ ξ Kn+M assures monoticity of the ψ(·) function [10]. Additionally, we impose an upper bound M n for the summation of absolute values of all the B-spline coefficients of ψ(·) for theoretical consideration. M n is a constant depending on n and the rate of M n is given in Section 3. Thus, the minimization becomes Set d = K n + M . The objective function in (1) is quadratic in ξ and quite nonlinear in β. The constraint β = 1 is nonlinear in the elements of β. The inequality constraint in (1) is linear in ξ since it can be expressed as Bξ ≤ 0, where ξ = (ξ 1 , · · · , ξ d ) T and B is a (d−1)×d matrix with B(i, i) = 1, B(i, i+1) = −1 and the rest of its entries being zero. To facilitate the implementation, we now propose an iterative estimation algorithm to solve (1). In particular, we iteratively solve β with ξ fixed at their current values, and then solve ξ with β fixed at their current values, and repeat them until the convergence criterion is met. The computation procedure can be summarized as the following.
Step 2: Given the initial estimates of the index values {Z i = β ( )T X i , i = 1, · · · , n}, minimize over ξ by solving the followng quadratic programming (QP) problem: Denote the solution as ξ ( ) .
Step 3: Fix ξ at the current values, minimize Denote the solution as β ( +1) . This problem can be solved using the nonlinear least squares (NLS) algorithm.
for a small ε > 0, which takes value 1e-3 in our numerical studies.
In our numerical examples, we use the MATLAB's optimization toolbox: the function quadprog() for QP in Step 2 and lsqnonlin() for NLS in Step 3. In this paper, we choose cubic B-spline for all numerical studies and real data application. Our algorithm usually converges in less than 10 iterations.
Given K n , we choose to place the interior knots at equally-spaced sample quantile of the predictor variable, which is β T X in this context. For example, if there are 4 interior knots, then they would be respectively at the 20th, 40th, 60th, 80th percentile. The boundary knots are naturally chosen as the minimum and maximum values of the predictor variable. During the iteration, the estimated single index β could change at each step, therefore the knots also change in the iteration. The number of knots K n can be tuned with cross-validation. In general, 5 to 10 knots will be sufficient to have very good results.

Asymptotic results
We establish the asymptotic properties of the estimators ( β n , ψ n ), including their consistency under certain metric, the convergence rates, and the asymptotic distribution of √ n( β n − β 0 ). We need the following conditions.
(C.1) β 0 is assumed to be in the unit ball B of R p and X has a compact support.
Under these conditions, we first obtain the consistency and convergence rate of ( β n , ψ n ). Theorem 1. Under (C.1)-(C.3), we further assume K n = C 1 n γ and M n = C 2 n τ for some positive constants C 1 , C 2 with γ > 0, τ ≥ 0, and 11γ + 9τ where W s,∞ is the Sobolev space consisting of functions with bounded lth derivatives for any l ≤ s. Furthermore, the Sobolev norm is defined as . The asymptotic distribution of β n is stated in the following theorem. Theorem 2. In addition to (C.1)-(C.3), we assume K n = C 1 n γ and M n = C 2 n τ for some positive constants Based on Theorem 2, a consistent estimator for the asymptotic covariance is given by Σ −1 1 Σ 2 Σ −1 1 in which Σ 1 and Σ 2 are given as follows. Then an estimator for Σ 1 is given as an estimator for Σ 2 is given by

R. Song et al.
Under Theorem 1, it is clear that both Σ 1 and Σ 2 are consistent estimators for Σ 1 and Σ 2 respectively when the sample size converges to infinity. Finally, we estimate the optimal decision rule as sign( β T n X). Under such a rule, for any subject, the reward gain of using the optimal rule vs the non-optimal rule is estimated to be 2P n | ψ n ( β T n X)| .

Numerical studies
In this section, we conduct extensive simulations to investigate the empirical performance of our proposed method. We first use three examples (Examples I-III) to compare our method with the inverse probability weighted estimator(IPWE), augmented inverse probability weighted estimator(AIPWE) in [24] and ordinary least square based on minimizing Finally, in Example IV, we investigate the performance of our method under model misspecification (i.e. when ψ(·) is not monotone). We consider the model A is generated as −1 and 1 with equal probability 0.5 and the noise follows a normal distribution with mean 0 and standard deviation σ = 0.5. The four examples are: To evaluate the estimation performance of the single index coefficient, we report its bias and the mean squared error MSE(β) = average over replications of β − β 0 2 /p. To evaluate the estimation performance of the link function, we report its mean squared error MSE(ψ) = average over replications of 1 To evaluate the accuracy of a treatment assignment rule sign( β T X), we calculate the percentage of making correct decisions (PCD), i.e.
We also study the behavior of the value function estimates. Based on the estimated rule, the value function can be estimated as 1 where g i is the estimated rule. We compare the proposed method with [24] in terms of parameter estimates, percentage of making correct decisions (PCD) and value function estimates.
From Tables 1-3, we observe that our method shows better results compared with the inverse probability weighted estimator (IPWE) and the augmented inverse probability weighted estimator (AIPWE) [24] in terms of smaller bias of estimated single index coefficient, smaller mean square error of estimated link function. In most cases, the bias of estimated single index coefficient of our proposed approach is about ten times smaller than the other two approaches. As a result, our method also makes more correct decisions and gives estimated value function much closer to its theoretical value. We also note that as sample size increases, the mean squared error of the single index coefficient and estimated link function for three methods decreases, the PCD increases and the estimated value function gets closer to the true value function. However, Table 2 indicates that the ordinary least square method performs comparably with our method but gives larger PCD than all the other methods when ψ(0) = 0. This is simply because that, ψ > 0, sign(X β ols ) = sign(X(X T X) − Table 4 indicates that all the methods are much worse under model misspecification. However, our method is still better compared to IPWE, AIPWE and the ordinary least square method. We also investigate our proposed inferential procedure for the single index coefficient β. It shows in Table 5 that, as sample size increases, the empirical standard error and the mean estimated standard error are getting closer to each other. For almost all cases, the empirical coverage rates are very close to the nominal level, as expected.

Data application
To further illustrate the performance of our method, we consider its application to data from AIDS Clinical Trials Group Protocol 175 (ACTG175). The complete data contain 2139 HIV-infected subjects with study subjects randomized to four different treatment groups: zidovudine (ZDV) monotherapy, ZDV + didanosine (ddI), ZDV + zalcitabine and ddI monotherapy. The CD4 count (cells/mm 3 ) at 20 ± 5 weeks post-baseline is chosen as the continuous response Y , where large values are desired. Among all subjects, 524 subjects received the treatments ZDV + didanosine (ddI) and 522 subjects received the treatment ZDV + zalcitabine. For illustration purpose, we consider these two group of patients with the goal to find their individualized optimal treatment rules. We use A = 1 to denote treatment ZDV + zalcitabine and A = −1 to denote treatment ZDV + didanosine (ddI). Besides the treatment indicator, we also include two covariates: age and homosexual activity (in short as homo), which are selected as important covariates in [13].
We apply the proposed method to estimate the optimal treatment and perform statistical inference for the corresponding parameters. The estimates for the single index coefficients are 0.902, -0.036, and 0.430 respectively and the estimated variance of the single index coefficients are 0.2232, 0.0004 and 0.0984, respectively. The optimal treatment rule is sign(0.902-0.036×age+0.430×homo). That is, if 0.902-0.036×age+0.430×homo ≥ 0, the optimal treatment for this patient is ZDV + zalcitabine, otherwise, the optimal treatment is ZDV + didanosine(ddI). In other words, for a patient with homo = 0, the optimal treatment A = −1 if age > 25.2 and the optimal treatment A = 1 otherwise; while for a patient with homo = 1, the optimal treatment A = −1 if age > 37.2 and the optimal treatment A = 1 otherwise. We note that the age of study subjects ranges

Discussion
In this paper, we proposed a novel semiparametric single-index model for individualized treatment selection. Our model plays an important role as a compromise between parametric models and nonparametric models [24]. The decision rule based on our method is a simple linear combination of covariates. We provide statistical inference for this rule. The asymptotic properties for the proposed method are established. The proposed method demonstrates superior numerical behavior in terms of smaller bias and means square error. Based on the estimated rule, our method also provides more precise decisions than existing methods and gives more precise value function estimates. In many clinical studies, the state space is often of very high dimension. To develop optimal individualized treatment rules in this case, it will be important to develop simultaneous variable selection and treatment rule estimation. Variable selection techniques such as penalized regression and variable screening can be nested into our semiparametric single index modeling framework as powerful tools to develop optimal individualized treatment rules.
In our current procedure, we assume the propensity score π(A|X) is known. In observational studies, the propensity scores are often unknown. For such Table 5 Inference for the single index parameters of Examples I-III. std1: empirical standard deviation, std2: mean estimated standard deviation, cover: empirical coverage rate of 95% confidence intervals.
Example I n = 500 n = 1000 n = 1500 bias std1 std2 cover bias std1 std2 cover bias std1 std2 cover observational data, we can estimate π(A|X) via logistic regression and plug-in the estimated propensity score funtion π(A|X) into the optimization equation (1). It is beyond the scope of the current work and is an interesting topic for future study.

Appendix
Denote Z = (A, X, Y ) and θ = (β, ψ). Let P n denote the empirical measure based on Z 1 = (A 1 , X 1 , Y 1 ), . . . , Z n = (A n , X n , Y n ) and P denote its expectation. Define Then θ n = ( β n , ψ n ) minimizes P n (Z; θ) over Θ n = B × Ψ n where Finally, let [a, b] be a finite interval containing all β T x for β ∈ B and x in the support of X.

Proof of Theorem 1
Since β n is bounded, by choosing a subsequence, we assume β n converges almost surely to a random variable β * . Clearly β * = 1. Take ψ n as the projection of ψ 0 on Ψ n . According to [17], it satisfies that Recall W 1,∞ is the Sobolev norm defined in the space containing all the functions whose derivatives are essentially bounded. Since θ n = ( β n , ψ n ) is the minimizer of P n (Z; θ) over Θ n = B × Ψ n , we have P n (Z; β n , ψ n ) ≤ P n (Z; β 0 , ψ n ), we further obtain where G n denotes the empirical process √ n(P n − P ). We then consider the following class of functions: Since β ∈ B the unit ball in R p , we can construct a -net for B, β 1 , β 2 , . . . , β K with K = O(1/ p ), such that for any β ∈ B, there is an s such that |β T X − β T s X| ≤ . Furthermore, for any (β, ψ) ∈ Θ n , we have |ψ (β T X)| ≤ O(M n K n ), so the -bracket covering number for H n is of order exp {O(M n K n / )} / p (Corollary 2.7.2, van der Vaart and Wellner [22]). Consequently, another class of functions, which is defined as F n = (Z; β, ψ) − (Z; β 0 , ψ n ) : (β, ψ) ∈ Θ n has the bracket covering number of the order Note that the L 2 (P )-norm of the envelope function of F n is bounded above by According to Lemma 19.38 of [21], we obtain that This implies that the left-hand size of (3) is bounded by O( K n M 5 n /n). Thus we have We further have 378 R. Song et al.
Note that We have Following the continuity of E[ψ 0 (β T 0 X)|β T X] and the dominate convergence theorem, we immediately obtain Differentiate both sides with respect to X and evaluate at one point x 0 in its support satisfying ψ 0 (β T 0 x 0 ) > 0. Then we conclude that β * is proportional to β 0 . Therefore, β * = β 0 . We reuse inequality (6) and by the mean value and the condition ] > 0, we thus have We reuse (4) and recall | ψ | ≤ O(K n M n ), it then gives This further gives Finally, by the fact that the L 2 -norm between two functions in Ψ n is bounded from below by the Euclidean norm of the corresponding coefficient vectors subject to a constant ( [2], p. 155), we have Hence, from the Cauchy-Schwartz inequality, we obtain which indicates that which is o p (1) by the choice of K n and M n . The consistency of ψ n follows. Furthermore, which is bounded by the choice of K n and M n . It shows that ψ n 's derivative is bounded. Now we are going to improve the convergence rate of θ n = ( β n , ψ n ). Since we have shown the consistency of θ n = ( β n , ψ n ), similar to the proof of convergence rate in [7], we may also restrict ψ n to the following class of functions: where c is a large positive constant. Let's first re-examine equation (3). For the left-hand side of (3), since is a VC-class, by Lemma 2.6.19 of [22], is VC-major. By Theorem 2.6.9 of [22], this class has a uniform entropy bounded by log sup where both c 1 and c 2 are constants. This gives that the left-hand side of (3) is O p (n −1/2 ). The right hand side of (3) is bounded from above by Therefore, it gives From equation (5), we again have By the consistency of β n and the condition ] > 0, we thus have the improved convergence rate for β n : Combining with (9) but since now ψ n (x) is bounded, we obtain We can further improve the rate in (12). To see this, we note that from (11) and (12), in the left-hand side of (3), for a fixed ν ∈ (0, 1/2), converges to zero in L 2 (P )-norm and with probability close to one, it belongs to a class n 1/2−ν (Z; β, ψ) − (Z; β 0 , ψ n ) : for a large M . This class satisfies conditions in Theorem 2.11.22 of [22]. Thus, the left-hand side of (3) is equal to o p (n −1+ν ). Consequently, we can improve inequality (11) and (12) to Correspondingly, inequality (8) can be improved to This immediately gives Theorem 1 then holds if we set 0 < ν < min(1 − 3γ, 1/2).
Thus, we obtain √ n(P n − P ) X ψ n ( β T n X) For the left-hand side of (16), we note that belongs to a P-Donsker class because β T X : β ∈ B is a VC class and both ψ n and ψ n are Lipschitz continuous. Moreover, this function converges in L 2 (P )norm to Thus, the left-hand side of (16) is equivalent to We further expand the right-hand side of (16) to obtain √ nP X ψ n ( β T n X) ψ n (β T 0 X)X T ( β n − β 0 ) + ( ψ n (β T 0 X) − ψ 0 (β T 0 X)) +O( √ n β n − β 0 2 ) We then replace X ψ n ( β T n X) by Xψ 0 (β T 0 X). Due to the boundness of ψ n , we obtain that the right-hand side of (16) is equivalent to From the convergence rates of ( β n , ψ n , ψ n ) in (13) and (15), we finally conclude that the right-hand side of (16) is equal to Furthermore, by the choice of K n , the above expression is equal to √ nP ψ 0 (β T 0 X) 2 XX T ( β n − β 0 ) + o p (1).