High-Dimensional Inference for Personalized Treatment Decision

Recent development in statistical methodology for personalized treatment decision has utilized high-dimensional regression to take into account a large number of patients’ covariates and described personalized treatment decision through interactions between treatment and covariates. While a subset of interaction terms can be obtained by existing variable selection methods to indicate relevant covariates for making treatment decision, there often lacks statistical interpretation of the results. This paper proposes an asymptotically unbiased estimator based on Lasso solution for the interaction coefficients. We derive the limiting distribution of the estimator when baseline function of the regression model is unknown and possibly misspecified. Confidence intervals and p-values are derived to infer the effects of the patients’ covariates in making treatment decision. We confirm the accuracy of the proposed method and its robustness against misspecified function in simulation and apply the method to STAR*D study for major depression disorder.


Introduction
Precision medicine aims to design optimal treatment for each individual according to their specific conditions in the hope of lowering medical cost and improving efficacy of treatments. Existing methods can be broadly partitioned into regression-based or classification-based approaches. Popular regression-based approaches include Q-learning [21,11,3,6,7,17] and A-learning [13,10,8,16,15]. Q-learning models the conditional mean of the outcome given covariates and treatment while A-learning directly models the interaction between treatment and covariates that is sufficient for treatment decisions. Other regression-based methods have been developed in [18], [12], [19], etc. In contrast, classification-based approaches, also known as policy search or value search, estimate the marginal mean of the outcome for every treatment regime within a pre-specified class and then take the maximizer as the estimated optimal regime; see, for example, [13], [22], [23], [25], [26]. As emerging technology makes it possible to gather an extraordinarily large number of prognostic factors for each individual, such as genetic information and clinical measurements, regression-based and classification-based methods have been extended to handle high-dimensional data [12,25,26,19] This paper focuses on regression-based approaches and plans to address the current limitation that the selected interaction effects often lack statistical interpretation when the number of covariates exceeds the sample size. Consider the following semi-parametric model. Denote X as the n × p design matrix, where n is the number of patients and p is the number of patients' covariates. Let X be the design matrix with an additional column of 1's and X i the i-th row of X. Denote Y i as the outcome and A i = {0,1} the treatment assignment of the i-th patient. Assume where μ(X i ) is an unspecified baseline function and β 0 is the vector of unknown coefficients for the interactions between treatment and patients' covariates. We are interested in inference on β 0 and optimal treatment decisions based on the estimated β 0 .
We propose an asymptotically unbiased estimator for β 0 and derive its limiting distribution when p > n. Consequently, confidence intervals and p-values of the interaction coefficients can be calculated. Because the baseline function is unknown and possibly misspecified, we investigate the robustness of the estimator to misspecified μ(X i ).
The proposed estimator for β 0 and the p-value calculated from data can be utilized to make significance-based optimal treatment decision. We illustrate the efficiency of the new treatment regime in a real application to STAR*D study for major depression disorder.

An A-learning framework
We adopt the robust regression approach of [16] and transform the interaction from A i β 0 is the propensity score. Because E(A i −π|X i ) = 0, the transformed interaction is orthogonal to the baseline function μ(X i ) given X i . By doing so, we can protect the estimation of β 0 from the effect of baseline function misspecification [8,16]. For simplicity of presentation, we consider a completely randomized study with prespecified propensity score, i.e. A = {0,1} and π(X i ) = P(A i = 1|X i ) = π. Let The true μ π (·) is unknown. Let μ π ( ⋅ ) be an estimator of μ π (·) that does not necessarily converge to the true μ π (·). Assume μ π ( ⋅ ) μ π *( ⋅ ) uniformly for some function μ π *( ⋅ ).

Unbiased estimation with misspecified baseline function
We propose a de-sparsified estimator for β 0 motivated by [24] and [20]: where Θ is an estimator of the precision matrix of D(A, π)X.
We show that this estimator is asymptotically unbiased and derive its limiting distribution as follows. Decompose b − β 0 into three terms: Consequently, for an arbitrary q × (p + 1) matrix, H, we can decompose nH(b − β 0 ) into three terms. The reason to study nH(b − β 0 ) is to explore the asymptotic joint distribution of q arbitrary linear contrasts of n(b 1 − β 0, 1 ), …, n(b p − β 0, p ) , which includes the limiting distribution of a single n(b j − β 0, j ) as a special case. We derive the limiting distribution of Condition 2: There exists a μ π *( ⋅ ) such that for any X i ∈ R p + 1 , a.
Condition 3: Denote X j as the jth column of X. ‖X j ‖ ∞ ≤ C for all 1 ≤ j ≤ p + 1. Next, we show that nHΔ 1 and nHΔ 2 are at the order of o p (1). Similar to [20], we apply lasso for nodewise regression to construct Θ. Details of the construction are presented in Section A.1. Consider additional assumptions: The remaining component of nH(b − β 0 ) is nHη. We show that nHη converges to a multivariate normal distribution. Since η does not involve β − β 0 , less conditions are needed in the following lemma. (1). Assume Conditions 1-3 and 6 -7. Obtain Θ by nodewise regression with λ j ≍ log(p)/n for any j ∈ {1,…,p}. Then
We summarize the above results in the following theorem.

Some examples for μ π
In real applications, we need to choose an estimator μ π for the baseline function. The only requirement for μ π is Condition 2, which is very general. Here we discuss two examples for Example 1: μ π (X i ) = constant. In this case, Condition 2 (a) holds trivially.

Condition 2 (b) is implied by
The following corollary presents the limiting distribution of nH(b − β 0 ) in this case. Proof of this corollary is straightforward and, thus, omitted.
Note that the solution β from (9) is equivalent to the solution from (3) (detail of the proof is included in the proof of Corollary 2.6). In this case, we replace Condition 2 with Condition Corollary 2.6. Consider model (1). Implement μ π (X i ) = γ T X i and obtain γ and β from (9) with λ n, p ≍ log(p)/n Assume Conditions 1, 3 -7, and 9.

Interval estimation and p-value
The theoretical result on the asymptotic normality of the de-sparsified estimator b can be utilized to construct confidence intervals for the coefficients of interest and to perform hypothesis testing. The variance-covariance matrix G can be approximated by HΘVΘ Therefore, a pointwise (1 − α) confidence interval of β 0,j can be constructed as One can also calculate the asymptotic p-value for testing for a given j ∈ {1,…,p}. A non-zero β 0,j means that variant X j is relevant in making treatment decision for a patient.

Simulation
We consider three simulation settings with different baseline functions: We simulate the random error from N(0,1) and the design matrix X from multivariate normal MN(0,Σ), where Σ = AR(0.5). The parameters in the baseline functions is set as γ = (1,1,0,···,0) T , and the parameter of interest β 0 has s nonzero elements with values 1 or 1.5.
The treatment main effect is set at zero.
We apply our de-sparsified estimator b : in (4) with μ π (X i ) = γ T X i and derive confidence intervals by (10). To evaluate the finite-sample performance of our method, we report the following measures. Denote CI j as the 95% confidence interval for β 0,j and S true = {j ∈ {2, …p + 1}: β 0,j ≠ 06}.
• MAB(b ): the mean absolute bias of b for relevant variables, calculated as • MAB(β ): the mean absolute bias of lasso solution for relevant variables, calculated as s −1 ∑ j ∈ S true β j − β 0, j .
• Coverage for noise variables: the empirical value of • Coverage for relevant variables: the empirical value of • Length of CI for noise variables: the empirical value of (p − s) −1 ∑ j ∈ S true c length(CI j ).
• Length of CI for relevant variables: the empirical value of Tables 1-2 summarize the performance measures for case 1-3 from 100 simulations with n = 200, p = 300 and s = 5 or 10. It is shown that the mean absolute bias for relevant variables of our estimator b is significantly smaller than that of the lasso solution in all settings. The coverages of the confidence intervals are fairly consistent for all cases 1-3, concurring with the theoretical results on the robustness of the method in Corollary 2.6. We notice that the coverage of confidence intervals for noise variables is very close to the nominal level of 0.95, but that for the relevant variants is lower than 0.95. Similar phenomena have been observed for the original de-sparsified estimator in [20] and [4]. This phenomenon indicates that even though the bias of the original Lasso estimator has been corrected asymptotically by the de-sparsifying procedure, the non-zero coefficients can still be under-estimated with finite sample. Further, given that the implemented μ π (X i ) = γ T X i , it agrees with our expectation that the lengths of the confidence intervals are shortest for Case 1 when the true baseline function is linear and increases in Case 2 and 3 with nonlinear baseline functions.

Application to STAR*D study
We consider the dataset from multi-site, randomized, multi-step STAR*D (Sequenced Treatment Alternatives to Relieve Depression) study for Major depressive disorder (MDD), which is a chronic and recurrent common disorder [5,14]. In STAR*D study, patients received citalopram (CIT) which is a selective serotonin reuptake inhibitor antidepressant at Level 1. Patients who have had unsatisfactory outcome at level 1 are included in level 2 to randomly receive one of the two treatment switch options: sertraline (SER) and bupropion (BUP). Patients who received treatment at level 2 but have not showed sufficient improvement will be randomized at level 2A.
Our data contains 319 patients who received treatment switch options at level 2. Among them, 48% of the patients received BUP and 52% received SER. Except for treatment indicator, 308 covariate variables of the patients are included. The outcome of interest is the Quick Inventory of Depressive SymptomatologySelf-report QIDS-SR 16 . Similar to existing studies on STAR*D [15], we transform the outcome to be the negative of the original outcome. We are interested in making optimal treatment decision between SER and BUR to maximize the mean outcome.
We apply the de-sparsified estimator b in (4) with μ π (X i ) = γ T X i and derive the p-value based on the limiting distribution in Corollary 2.6. The top-ranked p-values that are less than 0.05 are presented in Table 3 with the corresponding covariates.
The meanings of the notations in Table 3 are as follows. Qccur r rate: QIDSC score changing rates. URNONE: no symptoms in patients' urination category. NVTRM: tremors. IMPWR: indicating whether patients thought they have special powers. hWL: hRS Weight loss. DSMTD: recurrent thoughts of death, recurrent suicidal ideation, or suicide attempt. hMNIN: hRS Middle insomnia. EMSTU: did you worry a lot that you might do something to make people think that you were stupid or foolish?
We have also derived the 95% confidence intervals by (10) for the interaction effects between all covariates and treatment options. The confidence intervals for the top 14 interaction effects are presented in Figure 1.
We also apply cross validation to evaluate the effects of the selected variables in making optimal treatment decision. Specifically, we randomly divide the data in half into the training dataset and the testing dataset. On the training dataset, we use the 14 selected variables to compute the least squares estimator for interaction coefficients and formulate the estimated optimal treatment regime. On the corresponding testing dataset, we use the estimated value function derived by the inverse probability weighted estimator [23] to measure the performance of the estimated optimal treatment regime. For comparison, we also compute the estimated value functions for the regimes of assigning only SER and only BUP. We conducted 1000 data splittings and present the boxplot of the estimated value functions of all the treatment regimes in Figure 2. It shows that our estimated optimal treatment regime generally gives larger estimated value functions than those of assigning only SER and only BUP.

Discussion
In this paper, we have considered randomized studies with a pre-specified propensity score. The proposed method can be extended to observation studies with high-dimensional covariates, where the propensity score model needs to be correctly specified from data. Related work on propensity score estimation can be found in [16]. However, the derivation of the limiting distribution of our desparsified estimator for the treatment-covariates interaction coefficients would be much more involved and requires further investigation.

A.1. Construction of ΘŴ
e apply lasso for nodewise regression to obtain a matrix Θ such that ΘΣ is close to I as in [9]. Let X − j denote the matrix obtained by removing the jth column of D(A, π)X. For each j = 1,…,p + 1, let with components γ j, k ,k = 1,…,p + 1 and k ≠ j. Further, define

A.2. A preliminary lemma
Define Proof of Lemma A.1: For the first term of (12), it is clear that by condition 1, Further, by the moment inequality in Chapter 6.2.2 of [1], where the second inequality is by conditions 1 and 3. Then, by Markov inequality, 2ϵ T D(A, π)X/n ∞ = O p ( log(p)/ n) . (13) Next, consider the second term of (12). It is easy to see that since E(D(A, π) X) = 0, Then, similar arguments as those leading to (13) combined with conditions 2 (b) and 3 gives Finally, by condition 2 (a), the third term of (12) Combining (13) - (15) gives the claim in Lemma A.1.
The first term of the righ-hand side where the order of 2ϵ μ The rest of the proof is similar to the proof of the second claim of Lemma 2 in [2], where it is shown that given conditions 3 -5, the compatibility condition holds with probability tending to 1. Then using Lemma A.1 and similar arguments in Section 6.2.2 of [1], the inequality in (5) holds.
The first term of the above is equal to 0 because E(∫ X, A) = 0 by condition 1. The second term also equals to 0 since E(D(A, π) X) = 0.
On the other hand, condition 7 combined with similar arguments as in the proof of the first claim of Lemma 2 in [2] gives Θ j − Θ j 1 = o p (1/ logp
On the other hand, the construction of β implies where the second inequality is by (17). Therefore, by the uniqueness of the solution of convex optimization, β = β*.
Secondly, we show that Given the fact that μ π (X i ) and (A i − π)β 0 T X i are orthogonal and E(ϵ i X) = 0 from condition 1, By the second claim of Lemma 2 of [2], (19) combined with conditions 3 -5 and 9 (a) gives (18).
Next, it is easy to see that condition 2 (a) is implied by (18) and condition 3, and condition 2 (b) holds trivially given condition 9 (b). Therefore, condition 2 is satisfied and the rest of the proof is the same as the proof of Theorem 2.4. The 95% confidence intervals of the top 14 interaction effects Boxplot of the estimated value functions from cross validation testing samples. 'optimal' is for the estimated optimal treatment regime; 'all.BUP' is for the regime of assigning only BUP; 'all.SER' is for the regime of assigning only SER. Coverage of confidence interval and mean absolute bias (MAB) of b with p = 300 and s = 5. Standard errors are in parenthesis.

Intensity=1
Intensity=1.5  Coverage of confidence interval and mean absolute bias (MAB) of b with p = 300 and s = 10.  Table 3.
Top-ranked p-values and the corresponding covariates that are most relevant for making treatment decision.