Subgroup Identification Using the personalized Package

A plethora of disparate statistical methods have been proposed for subgroup identification to help tailor treatment decisions for patients. However a majority of them do not have corresponding R packages and the few that do pertain to particular statistical methods or provide little means of evaluating whether meaningful subgroups have been found. Recently, the work of Chen, Tian, Cai, and Yu (2017) unified many of these subgroup identification methods into one general, consistent framework. The goal of the personalized package is to provide a corresponding unified software framework for subgroup identification analyses that provides not only estimation of subgroups, but evaluation of treatment effects within estimated subgroups. The personalized package allows for a variety of subgroup identification methods for many types of outcomes commonly encountered in medical settings. The package is built to incorporate the entire subgroup identification analysis pipeline including propensity score diagnostics, subgroup estimation, analysis of the treatment effects within subgroups, and evaluation of identified subgroups. In this framework, different methods can be accessed with little change in the analysis code. Similarly, new methods can easily be incorporated into the package. Besides familiar statistical models, the package also allows flexible machine learning tools to be leveraged in subgroup identification. Further estimation improvements can be obtained via efficiency augmentation.


Introduction
Many studies of medical interventions, especially clinical trials, often focus on population average treatment effects. However, it is widely recognized that the effects of treatments can have substantial differences across a population. With the increasing interest to improve the efficacy and effectiveness of health care, there has been a significant effort in the statistics community to develop methodology for optimal allocation of treatments to patients. Optimal treatment allocation can be thought of as a subgroup identification task, where subgroups are determined based on the heterogeneity of treatment effect. Heterogeneity of treatment effect can be characterized by the interaction of the treatment with patient characteristics. Thus, the goal in subgroup identification is to characterize and estimate these interactions in order to construct an optimal mapping from patient characteristics to a treatment assignment. This mapping is called an individualized treatment rule (ITR). An optimal ITR is one that, when enacted on a population, results in the largest expected patient outcome, assuming without loss of generality that larger outcomes are preferred. The overall patient outcome is impacted by both the main effects of patient characteristics and the treatment-covariate interactions and thus many approaches, such as Qian and Murphy (2011), focus on modeling this full relationship to estimate ITRs. In their work, Qian and Murphy (2011) show robustness properties to model misspecification under certain conditions. Many recent works have instead focused on methods which do not require correct specification of the entire relationship between patient characteristics, treatment, and outcome, but only the parts relevant to the optimal ITR and are thus often more robust to modeling choices. Regardless of the general modeling approach, a vast majority of methods for ITR estimation do not have corresponding R packages and those that do often pertain to particular statistical methods for optimal ITR estimation (Huang, Sun, Chatterjee, and Trow 2017;Riviere 2021;Dusseldorp, Doove, Van de Put, Mechelen, and Claramunt Gonzalez 2020;der Elst, Alonso, and Molenberghs 2020;Holloway, Laber, Linn, Zhang, Davidian, and Tsiatis 2020;Egami, Ratkovic, and Imai 2019). In addition, there has been much focus on estimation of subgroups based on patient characteristics, yet not enough emphasis on evaluation of the treatment effects within the resulting estimated subgroups, which is an equally important but challenging aspect of any subgroup analysis. Chen et al. (2017) revealed that a wide range of existing statistical methods for optimal ITR estimation fall under the umbrella of a unified estimation framework. This unified framework focuses on the estimation of treatment scores, which rank patients based on their individualized treatment effect. The scoring system encompasses optimal ITR estimation in the sense that a threshold for the treatment score can be used as a treatment assignment mechanism. The personalized package (Huling 2021) aims to be a versatile tool in the R statistical language (R Core Team 2021) for optimal ITR estimation and treatment scoring corresponding to the framework of Chen et al. (2017). Further, two valid approaches for estimation of treatment effects within the estimated subgroups are provided and can be used straightforwardly with any of the available methods for estimation of treatment scores, enabling validation of fitted subgroup identification models. The personalized package offers an entire subgroup analysis workflow with intuitive and easy-to-use structure. Thus, a wide range of subgroup identification methods can be accessed with little change in the analysis workflow of the user. Furthermore the subgroup identification framework allows the practitioner to conduct a subgroup identification analysis using familiar statistical modeling concepts. The package is designed to accommodate a wide range of subgroup identification and treatment decision-making analyses.
The features of the personalized package include: 1. A wide range of loss function-based subgroup identification methods. 2. Modeling options for continuous, binary, count, and time-to-event outcomes.
3. Accommodation of observational studies via either propensity score-based analysis or matching. 4. Handling of both binary and multiple treatment scenarios. 5. Efficiency improvements through loss augmentation. 6. Evaluation of estimated subgroups with correction for overfitting. 7. Options for utilizing custom loss functions.
The package is available on the Comprehensive R Archive Network (CRAN) at https:// CRAN.R-project.org/package=personalized in addition to a development version available as a GitHub repository at https://github.com/jaredhuling/personalized.
In Section 2 we provide background on the methodological underpinnings of package personalized, followed by a detailed description of the package itself in Section 3. In Section 4 we evaluate the various methods offered in the personalized package in a numerical comparison with several methods from other packages. Finally, in Section 5 we demonstrate the use of the package on a subgroup identification analysis of the National Supported Work Study (LaLonde 1986) and conclude with some discussion in Section 6.

Individualized treatment effects, benefit scores, and individualized treatment rules
In this section we provide a formal overview of the subgroup identification framework of Chen et al. (2017). We first consider binary treatments and then provide extensions to multi-category treatments in a later section. Let the treatment assignment be denoted as T ∈ T = {−1, 1}, where T = 1 indicates that a patient received the treatment, and T = −1 indicates a patient received the control. We also observe the patient outcome Y , where larger values are assumed to be preferable without loss of generality. We further observe a length p vector of patient covariate information X ∈ X . Note that the first element of X is an intercept term. The covariates may modify the effect of T on Y , resulting in treatment effect heterogeneity. Relating the above to observable quantities, we observe data from n patients {(Y i , T i , X i ), i = 1, . . . , n} consisting of n independent, identically distributed copies of (Y, T, X). In identifying subgroups of patients who may benefit from T differently, we are often interested in estimating the contrast function ∆(X) ≡ E(Y |T = 1, X) − E(Y |T = −1, X). (1) Note that (1) involves a difference of means. There can be cases where a ratio, may instead be a more interpretable or relevant estimand, especially for positive Y . Treatment effect heterogeneity is clearly reflected only through either ∆(X) or Γ(X). To see how these quantities relate to the full regression model, note that a completely unspecified regression unit when exposed to a treatment will be the same no matter what mechanism is used to assign the treatment. We also assume "strong ignorability" (Rosenbaum and Rubin 1983;Rubin 2005), that is, T ⊥ ⊥ (Y (1) , Y (−1) ) | X. Violations of SUTVA can occur when there are spillover effects from treated units to other units, however in this paper we always assume SUTVA holds. We assume that the treatment assignment mechanism is either known, as is the case in randomized controlled trials, or is unknown and can be estimated, as is the case when there are no unmeasured confounders. In other words, π(X) = P(T = 1|X) is either a known function or can be consistently estimated via regression modeling. Further, a "positivity" assumption that all patients have a chance of receiving the treatment, i.e., 0 < π(x) < 1 for all x ∈ X , is required. Under these assumptions, ∆(X) = E(Y (1) |X) − E(Y (−1) |X) and Γ(X) = E(Y (1) |X)/E(Y (−1) |X) are treatment effects conditional on patient characteristics. Note that in the potential outcome notation, the value function is V (d) = E[Y (d) ]. Many matching strategies (Imbens and Rubin 2015) can also be used instead of direct modeling of π(X) = P(T = 1|X). However, note that under matching, the targeted estimand can depend on the matching mechanism. For example, if matching is based on the treated subjects, then the estimand is the treatment effect on the treated conditional on patient characteristics, i.e., ∆ 1 (X) = E(Y (1) |X, T = 1)−E(Y (−1) |X, T = 1) or Γ 1 (X) = E(Y (1) |X, T = 1)/E(Y (−1) |X, T = 1).

Subgroup identification and benefit score estimation via loss functions
The framework of Chen et al. (2017) covers two classes of benefit score estimators. The two methods, called the weighting method and the advantage-learning (A-learning) method, are both quite general approaches for estimating ∆(X) or Γ(X) (or transformations of ∆(X) or Γ(X)) via loss functions. Both the weighting and the A-learning methods do not require specification of the full outcome regression model and focus on direct estimation of ∆(X), Γ(X), or transformations thereof. As we will explore in later sections, outcome regression models can, however, be incorporated into both the weighting and A-learning methods in order to improve efficiency. A major benefit of both the weighting and A-learning methods is that even when full outcome regression models are utilized, misspecification of the full outcome regression model does not impact the validity of the resulting estimators.
Consider a convex loss function M(y, v) used for the purpose of estimating benefit scores. A useful example is the squared error loss, M(y, v) = (y − v) 2 . In their original work, Chen et al. (2017) require M(y, v) to meet the following conditions i) M v (y, v) = ∂M(y, v)/∂v is increasing in v for every fixed y and ii) M v (y, 0) is monotone in y. These requirements are sufficient for Fisher consistent subgroup identification, however, they are not necessary. Conditions i) and ii) can be relaxed to incorporate a wider class of losses such as the hinge loss M(y, v) = y max(1 − v, 0). In Section 2.4, we point out that the conditions specified by Chen et al. (2017) on M for the multi-category treatment setting can also be relaxed.

Weighting method
The first estimation method is called the weighting method. Given a sample of n patients, the weighting method estimates ∆(X) or Γ(X) (or transformations thereof) by minimizing the following objective function with respect to f (X): where W indicates the weighting method and π(x) = P(T = 1|X = x) is the propensity score function. The weighting estimator is thenf W = argmin f L W (f ). The corresponding population level weighting estimator is the minimizer of with respect to f , where W again indicates the weighting method. The weighting method is valid without specification of the full outcome regression model, as the inverse weights result in the interactions T × f (X) being uncorrelated with the main effects g(X). The estimated benefit score under the weighting method,f W , can be used to estimate ∆(X) under many different loss functions. See Table 1 for examples.

A-learning method
The A-learning estimator involves minimizing where A indicates the A-learning method and (T i + 1)/2 = I(T i = 1). The A-learning estimator is thenf A = argmin f L A (f ). The A-learning method works without specification of the full regression model, because the centered interaction {(T + 1)/2 − π(X)} × f (X) is uncorrelated with, and in fact orthogonal to, the main effects g(X). This property follows from the fact that E[(T + 1)/2 − π(X)|X] is zero. The corresponding population level A-learning estimator is the minimizer of with respect to f , where again A indicates the A-learning method and (T + 1)/2 = I(T = 1).

Benefit score properties and estimands
Althoughf W (·) andf A (·) are not always themselves estimates of ∆(·), the zero point forf W (·) andf A (·) is always meaningful and can be used as a threshold for determining subgroups. For example, assuming that larger outcomes are preferred, all patients with covariates x such that f W (x) > 0 should have better outcomes under the treatment than under the control on average. More formally, denote the population estimators to be Under both the weighting and A-learning methods and conditions i) and ii) of M from above, for patients with a negative benefit score score (f A0 (x) < 0 or f W 0 (x) < 0), we have E{U(Y (1) )|X = x} > E{U(Y (−1) )|X = x} where U(y) = ∂M(y, v)/∂v| v=0 and for those with a positive benefit score, we have E{U(Y (1) )|X = x} < E{U(Y (−1) )|X = x}. Thus, ) and d A0 (x) = sign(f A0 (x)) are optimal decision rules for mapping patient characteristics X to treatment decision T . Note that treatment assignment decisions here, while based on the overall cutoff value of 0, are determined by the individual treatment effects. It was shown in Chen et al. (2017) that the estimates resulting from both the weighting and A-learning methods result in Fisher-consistent treatment decision rules under a wide class of types of outcomes and M functions. Hence, the estimated benefit scores can be used to optimally assign patients to treatment groups. For non-differentiable losses such as the hinge loss, similar arguments can be made.
Furthermore, the benefit scores themselves can reflect the magnitude of the individual treatment effect and thus can be used for ranking patients by how effective the treatment is. For Other choices of M(y, v) lead to different interpretations. See Table 1 for more examples of the relationship between f W 0 , f A0 and ∆(X) or Γ(X). As pointed out in Chen et al. (2017), similar to using surrogate loss functions in a classification setting (Bartlett, Jordan, and McAuliffe 2006), the final form of the solution, f W 0 or f A0 , depends on the choice of the loss functions. However, not all choices of M(y, v) lead to such direct interpretation. For example, the hinge loss M(y, v) = y max(0, 1 − v) does not seem to have a direct link with ∆(X), though the zero point of both f W 0 and f A0 under the hinge loss is still meaningful. The personalized package offers estimation under more losses than are listed in Table 1, however the additional losses lead to less interpretable estimates. A listing of all losses implemented in the personalized package is available in Table 2.
In scenarios where there are limited resources to allocate treatments, it may be of interest to find a smaller subgroup of patients to recommend the treatment than the subgroup resulting from patients with f W 0 > 0. Since f W 0 and f A0 rank patients by magnitude of treatment effect under most losses, and thus using f W 0 > c with c > 0 yields a smaller subgroup with larger treatment effect than f W 0 > 0. Thus, given limited resources for treatment allocation, using f W 0 > c can be useful to find a small subgroup of patients for whom the treatment is highly beneficial.

Loss function choices and relationship with other methods
Although many of the loss functions in Table 1  Each combination of the A-learning or weighting methods with a valid loss function results in a different estimator, allowing for a high degree of versatility in estimation. For example, M(y, v) = y log{1 + exp(−v)} corresponds to the method developed in Xu, Yu, Zhao, Li, Wang, and Shao (2015), M(y, v) = y max(1 − v, 0) corresponds to the outcome weighted learning (OWL) method of Zhao, Zeng, Rush, and Kosorok (2012), under a randomized } †Censoring rates are assumed to be equal within treatment arms.
is a monotone increasing function described in the Appendix of Tian et al. (2014) and Λ * A (t) is quite similar to Λ * W (t). Γ * W (X) corresponds to the estimand of the weighting method and Γ * A (X) corresponds to the estimand of the A-learning method. Under randomization into treatment and control groups with equal probability, the forms above for Γ(X) or ∆(X) for the A-learning method simplify dramatically. For example, under equal randomization and M(y, v) = y log(1 + exp{−v}), Γ(X) = exp{f A0 (X)/2}. clinical trial setting with treatments assigned with equal probability, both the A-learning and weighting methods with M(y, v) = (y − v) 2 reduce to the modified covariate method of Tian et al. (2014) for continuous responses, the A-learning method with M(y, v) = (y − v) 2 corresponds to the approach of Lu, Zhang, and Zeng (2013) and Ciarleglio, Petkova, Ogden, and Tarpey (2015), among others. Using the A-learning method with the squared error loss and loss augmentation (described below in Section 2.5) is equivalent, after accounting for any variable selection penalties, to the estimation method utilized in Zhao, Small, and Ertefaie (2017), Shi, Song, and Lu (2016), Shi, Fan, Song, Lu et al. (2018), and the method behind the de-sparsified estimator of Jeng, Lu, Peng et al. (2018).

Modeling choices for the benefit score
Modeling choices must be made for the form of f W and f A . One can use a simple form of f such as a linear combination of the covariates, i.e., f (X) = X β. Hencef (X) = X β . Such a choice leads to interpretable models. For most loss functions, if the effect β j of variable j is positive, then increased values of variable j lead to an increase in treatment benefit and negative effects lead to decreased benefit. Beyond linear forms, regression trees, smoothing splines, or other nonparametric and flexible approaches may be used for f W or f A .

Loss function example and implementation details
One of the key benefits of the framework of Chen et al. (2017) is the relative ease of imple-mentation. Many combinations of method (weighting or A-learning) and loss function can be computed by existing regression software. Either (3) or (5) can be minimized for a given loss by providing existing software a modified covariate X, provided the existing software can accept observation weights.
To see how this is accomplished, consider the familiar example of the squared error loss function M(y, v) = (y − v) 2 . Under this loss and the assumption that ∆(X) = X β, we can minimize (3) using existing software. First, denote the n × p design matrix of patient covariate information to be X. Denote a modified design matrix X to be diag(T )X, where T = (T 1 , . . . , T n ) and diag(T ) is the diagonal matrix with diagonal elements as the elements of the vector T . Then the minimizer of (3) is simply the weighted least squares estimator where W is a vector of weights with the i-th element as 1/(T i π(x i ) + (1 − T i )/2) and Y = (Y 1 , . . . , Y n ) . If X is high dimensional and variable selection is desired, X and Y along with a vector of observation weights can be supplied to existing software, such as the glmnet R package (Friedman, Hastie, Tibshirani, Narasimhan, Tay, and Simon 2021). More details on how this is handled in the personalized package are provided in Section 3.
More generally, existing software can be used to minimize (3) and (5) by appropriate construction of weights and modified design matrices. The modified design matrix for the weighting method is defined as in the example above, and the modified design matrix for the A-learning method is defined as X = diag((T +1)/2−π(X))X where π(X) is a vector with i-th element equal to π(x i ).

Extension to multi-category treatments
Often, more than two treatments are available for patients and the researcher may wish to understand which of all treatment options are the best for which patients. Extending the above methodology to multi-category treatment results in added complications, and in particular there is no straightforward extension of the A-learning method for multiple treatment settings. In the supplementary materials of Chen et al. (2017), the weighting method was extended to estimate a benefit score corresponding to each level of a treatment subject to a sum-to-zero constraint for identifiability. In particular, we are interested in estimating (the sign of) If ∆ kl (x) > 0, then treatment k is preferable to treatment l for a patient with X = x. For each patient, evaluation of all pairwise comparisons of the ∆ kl (x) indicates which treatment leads to the largest expected outcome. The weighting estimators of the benefit scores are the minimizers of the following loss function: subject to K k=1 f k (X i ) = 0. Clearly when K = 2, this loss function is equivalent to (3). Estimation of the benefit scores in this model is still challenging without added modeling assumptions, as enforcing K k=1 f k (X i ) = 0 may not always be feasible using existing estimation routines. However, if each ∆ kl (X) has a linear form, i.e., ∆ kl (X) = X β k where l represents a reference treatment group, estimation can then easily be fit into the same computational framework as for the simpler two treatment case by constructing an appropriate design matrix. Thus, for multiple treatments the personalized package is restricted to linear estimators of the benefit scores. For instructive purposes, consider a scenario with three treatment options, A, B, and C. Let X = (X A , X B , X C ) be the design matrix for all patients, where each X k is the sub-design matrix of patients who received treatment k. Under ∆ kl (X) = X β k with l as the reference treatment, we can construct a new design matrix which can then be provided to existing estimation routines in order to minimize (8). With treatment C as the reference treatment, the design matrix is constructed as where the i-th element of J is 2I(T i = C) − 1 and the weight vector W is constructed with the i-th element set to 1/P(T = T i |X = X i ). Furthermore denote β = (β A , β B ) . Hence , and thus the sum-to-zero constraints on the benefit scores hold by construction.
The conditions specified for the loss functions in the supplementary material of Chen et al. (2017) are too restrictive. In general, we find that all Fisher-consistent margin based classification loss functions (Bartlett et al. 2006;Tewari and Bartlett 2007;Zou, Zhu, and Hastie 2008;Tewari and Bartlett 2007) can be adopted for the weighting method in the multi-category treatment setting. The sum-to-zero constraint may be elegantly solved by mimicking the angle-based reformulation proposed by Zhang and Liu (2014) in the classification setting. However this approach has not yet been adopted in the personalized package.

Efficiency improvement via loss function augmentation
As mentioned in previous sections, the weighting and A-learning approaches do not require the specification of a full outcome regression model to consistently recover optimal subgroups. However, gains in efficiency can be made through augmentation of the loss function. Loss augmentation involves positing and fitting a full outcome regression model and using the predictions from this model to construct a modified loss function. This approach has a few benefits: First, if the full outcome regression model is indeed correctly specified, then the resulting estimator will be efficient and second, if the outcome regression model is incorrectly specified, it does not impact the Fisher consistency of the estimated subgroups. Further, in practice even when the outcome regression model is incorrect, efficiency gains are often realized.
The basic approach to augmentation is the following: 2. Create the augmentation function by integrating predictions over both treatment options: a(X i ) = T ∈T a T pred(X i , T ), where a T are weights. In practice, a simple average with a T = 1/2 works well. 3. Construct an augmented loss function M(y, v) = M(y, v) + g(a(X), v), where g(y, v) meets the same conditions required of M(y, v).
The augmented loss function M(y, v) does not change the optimality of the resulting decision rules, and thus allows for potential reductions in variance. The most efficient augmentation function under the class of augmentation functions of the multiplicative form vg(X) was derived in the supplementary material of Chen et al. (2017). However, in the personalized package, we consider a more limited class of augmentation functions g * (X) = g(a(X)) that allows for simpler implementation using the functionality for offsets provided in existing software.

Validating subgroups via subgroup-conditional treatment effects
A subgroup analysis under the framework of Chen et al. (2017) involves estimating a set of benefit scoresf ( . . , n} and then using the benefit scores to construct subgroups, i.e., a subgroup of patients for whom the treatment is "recommended" viaf (X i ) > 0 and a subgroup of patients for whom the treatment is not recommended, f (X i ) ≤ 0. Upon using the data for this purpose, some natural questions to ask are "what is the effect of the treatment among those withf (X i ) > 0?", "is the effect of the treatment among those withf (X i ) > 0 meaningfully different from zero?", and "what would be the improvement in outcomes over the population of interest if all patients followed the recommendations off (X i )?". Such questions are nontrivial to answer using the same n samples used to construct the subgroups. The subgroups themselves are conditional on the observed outcomes, thus simply evaluating treatment effects within subgroups will yield biased and overly-optimistic estimates of the subgroup-conditional treatment effects. See Athey and Imbens (2016) and Qiu, Zeng, and Wang (2018) for more discussion.
Denote the decision rule under benefit scoresf as d(X) = sign(f (X)). Then the overall benefit of the treatment assignment rule d over the population is Further define the following subgroup-conditional treatment effects as The quantities δ 1 (d) and δ −1 (d) may also be of interest. Directly calculating empirical versions,δ(d),δ 1 (d), andδ −1 (d) as defined below, of the above quantities using the following will yield biased estimates. The biased, empirical estimates arê Thus, alternative approaches are needed to estimate δ(d), δ 1 (d), and δ −1 (d). Another potentially interesting statistic to measure benefit of subgroup recommendations is the C-for-benefit statistic of Van Klaveren, Steyerberg, Serruys, and Kent (2018), however this is not used in the personalized package.

Bootstrap bias correction
The first approach used in the personalized package to estimate δ(d), δ 1 (d), and δ −1 (d) is the bootstrap bias correction approach of Harrell, Lee, and Mark (1996). The bootstrap bias correction approach seeks to estimate the bias in the estimates of the subgroup treatment effects that arise from using the same data to estimate these effects as was used to construct the subgroups and then corrects for this bias. This bootstrap bias correction method was introduced in Harrell et al. (1996) and later used in Foster, Taylor, and Ruberg (2011) for the purpose of evaluating subgroup effectiveness. For any statistic S, let S full (X) be the statistic estimated with the full training data . . , n} and evaluated on data X and S b (X) be the statistics estimated using a bootstrap sample . . , n}) and evaluated on X. The general outline of the bootstrap bias correction method is as follows: • Construct B bootstrap samples of size n with replacement. For the b-th bootstrap sample calculate the statistic S b (X) and S b (X b ).
• The bootstrap estimate of the amount of bias with regards to the statistic S is • The bias-corrected estimate of the statistic S is then calculated as The term S b (X b ) involves evaluating statistic S on the same data as was used to construct the underlying estimator. The term S b (X) involves evaluating the S on the original dataset, which acts like an external dataset. Thus, S b (X b ) − S b (X) mimics the bias that arises from evaluating a statistic on the same dataset that was used to construct the statistic/estimator.

Repeated training/testing splitting
The training and testing splitting approach we outline in this section is similar to the samplesplitting scheme of Qiu et al. (2018), however their approach is based on a K-fold type procedure, whereas ours is based on repeated splitting of the data. The personalized package has functionality for using the training/testing splitting procedure to estimate δ(d), δ 1 (d), and δ −1 (d). The procedure involves repeatedly randomly partitioning the data into a training portion and a testing portion. Each replication partitions τ × 100% of the data into the training data and (1 − τ ) × 100% into the testing data. Using similar notation as for the bootstrap bias correction approach, define for the b-th replication S train,b (X) to be statistic S constructed using the b-th training sample and evaluated on data X and X test,b to be the covariates from the b-th test data. The repeated training/testing splitting is as follows: • Construct B random partitions of the data using training fraction τ and for each b calculate S train,b (X test,b ).
• The training/testing splitting estimate of statistic S is then 1 Foster et al. (2011) explored a variety of approaches for estimating quantities similar to subgroup-conditional treatment effects and found bootstrap bias correction approaches to be the least biased and have lower variability than cross-validation based approaches. Foster et al. (2011) found that cross-validation approaches tend to underestimate effects of interest. Thus, we advocate the use of the bootstrap bias correction approach, however the training and testing splitting approach is appropriate as well and tends to give more conservative estimates of the subgroup-conditional treatment effects.

The personalized package
In this section we provide detailed information about the personalized package and how it is utilized in subgroup identification analyses. We begin by providing an outline of the workflow of the personalized package. The remainder of this section roughly follows the order of this workflow and explains each function involved in each step. Along the way, key arguments of each of these functions are described in detail with usage examples intermixed. Finally, this section concludes with a demonstration of an entire subgroup identification analysis on a simulated dataset with a multi-category treatment.

Workflow of subgroup identification analysis
Regardless of the specific modeling choices, the workflow of subgroup identification analyses in the personalized package has the following four steps: 1. Construct propensity score function and check propensity score diagnostics (function check.overlap()). 2. Fit a subgroup identification model using function fit.subgroup(). 3. Estimate the resulting treatment effects among estimated subgroups using function validate.subgroup(). 4. Visualize and examine the model (plot()), subgroup treatment effects (print()), and characteristics of the subgroups (summarize.subgroups()).
We will create a simulated dataset where we know the underlying data-generating mechanism. We will use this dataset throughout this paper for illustration. In this simulation, the treatment assignment depends on covariates and hence we must model the propensity score π(x) = P(T = 1|X = x). We also assume that larger values of the outcome are better. We generate 1000 samples with 50 covariates and consider continuous, binary, and time-to-event outcomes. The covariates in X in this dataset are generated from a normal distribution and are uncorrelated, the propensity function π(x) depends only on the 21st and 41st covariates, the optimal treatment rule depends on covariates 3, 11, 1, and 12 including linear terms an an interaction between covariates 1 and 12, and the main effects in the outcome regression model depend on covariates 1, 11, 12, 13, and 15 and include both nonlinear and linear terms.
Here we simulate the differential treatment effect.
Now we simulate the main effects and add them to the differential treatment effect and then generate i) continuous, ii) binary, and iii) time-to-event outcomes.

Observational studies
To deal with non-randomized treatment assignment in subgroup identification analysis for observational studies, we usually construct a model for the propensity score, which is the probability of treatment assignment conditional on observed baseline characteristics (Imbens and Rubin 2015). Our package also allows matched analysis for which this step or the check.overlap() function is not needed. In the personalized package, we need to wrap the propensity score model in a function which inputs covariate values and the treatment statuses and outputs propensity scores between 0 and 1. It is crucial for the personalized package to utilize the propensity score model as a function instead of simply a vector of probabilities. Later in this paper when we seek to evaluate the subgroup treatment effects, we must use either bootstrap resampling or repeated training and test splitting of our data. In both of these approaches we need to re-fit our subgroup identification model, including the propensity score model and hence it would be invalid to assume the propensity scores remain constant for each bootstrap iteration. A simple example of how one constructs their propensity score function is as follows.
To assess the positivity assumption, propensity scores should be checked to ensure sufficient overlap between treatment groups. This is a requirement for valid use of propensity scores. The personalized package offers a visual aid for checking overlap via the check.overlap() function, which plots densities or histograms of the propensity scores for each of the treatment groups. The following code generates Figure 1: To help assess whether there is sufficient overlap of the propensity score distributions between treated and untreated, the user should check whether the regions near 0 or 1 where there is either an area where there is a positive density of propensity scores for the treatment group but not the control group or for the control group and not the treatment group. Figure 1 illustrates the data has reasonable overlap, however there is a slight region near 0 with a positive density of propensity scores for the control group but no density of propensity scores for the treatment group. One may consider corrective measures to mitigate this. In the presence of insufficient overlap, techniques such as those proposed in Crump, Hotz, Imbens, and Mitnik (2009) may be utilized. Further discussion on identification of support overlap issues and approaches for mitigating these issues can be found in Caliendo and Kopeinig (2008) and Garrido, Kelley, Paris, Roza, Meier, Morrison, and Aldridge (2014).

Randomized controlled trials
If the data to be analyzed come from a randomized controlled trial, it is still valid to construct a propensity score model as above, but is not necessary. If the modeler knows that patients were randomized to the treatment group with probability 0.5, for example, the propensity function can simply be constructed as the following: R> constant.propensity.func <-function(x, trt) 0.5

The fit.subgroup() function
The fit.subgroup() function is the main workhorse of the personalized package. It provides fitting capabilities for a wide range of subgroup identification models for different types of outcomes. We will first show a basic usage of the fit.subgroup() function and then provide detailed information about its arguments and more involved examples.
The first estimate is the treatment main effect, which is always selected. Any other variables selected represent treatment-covariate interactions. The above code fits a model with linear interaction terms with a squared error loss as M (·, ·) with a lasso penalty for variable selection. The squared error loss is used due to the outcome y being continuous. The A-learning method is used for demonstrative purposes and in this situation, the weighting method could have been used as well. The nfolds argument is passed to the underlying fitting function cv.glmnet() from the glmnet package. The output provides some basic summary statistics of the resulting subgroups, benefit scores, and estimated conditional treatment effects for each sample and shows the estimated interaction coefficients.

Explanation of major function arguments
x The argument x is for the design matrix. Each column of x corresponds to a variable to be used in the model for ∆(X) and each row of x corresponds to an observation. Every variable in x will be used for the subgroup identification model (however some variables may be removed if a variable selection procedure is specified for loss).
y The argument y is for the response vector. Each element in y is a patient observation. In the case of time-to-event outcomes y should be specified as a 'Surv' object of the survival package (Therneau 2021). For example the user should specify y = Surv(time, status), where time is the observed time and status is an indicator that the observed time is the survival time.
trt The argument trt corresponds to the vector of observed treatment statuses. The vector trt can either be a character vector specifying the levels of the treatments (e.g., "Trt" vs. "Ctrl"), a factor vector, or an integer vector (e.g., for binary treatment, 1 or 0 in the i-th position indicates patient i received the treatment or control). For character vectors or integer vectors it is assumed that the first level alphabetically or numerically, respectively, is the reference treatment in the sense that the estimated benefit score will represent the benefit of the second treatment level with respect to the reference level. For example, if trt is a character vector with two treatment options "Trt" and "Ctrl", the estimated benefit score reflects the benefit of "Trt" versus "Ctrl" in the sense that positive estimated benefit scores indicate "Trt" is preferable to "Ctrl". For a factor vector, the first level of the factor will be the reference treatment. Without specifying otherwise, for trt vectors with more than 2 treatment levels, the reference treatment will be chosen in the same way. However, the user may specify which level of the treatment should be the reference via the reference.trt argument. For example, if trt has the levels c("TrtA", "TrtB", "Ctrl"), setting reference.trt = "Ctrl" will ensure that "Ctrl" is the reference level.
propensity.func The argument propensity.func corresponds to a function which returns a propensity score. While it seems cumbersome to have to specify a function instead of a vector of probabilities, it is crucial for later validation for the propensity scores to be re-estimated using the resampled or sampled data (this will be explained further in the section below for the validate.subgroup() function). The user should specify a function which inputs two arguments: trt and x, where trt corresponds to the trt argument for the fit.subgroup() function and x corresponds to the x argument for the fit.subgroup() function. The function supplied to the propensity.func argument should contain code that uses x and trt to fit a propensity score model and then return an estimated propensity score for each observation in x. If there are many covariates, the modeler may wish to use variable selection techniques in constructing the propensity score model. In the following code we construct the wrapper function for the propensity score model, which is a logistic regression model with the lasso penalty where the tuning parameter is selected by 10-fold cross-validation using the cv.glmnet() function of the glmnet package (Friedman et al. 2021): propens.model <-cv.glmnet(y = trt, x = x, family = "binomial") + pi.x <-predict(propens.model, s = "lambda.min", newx = x, For randomized controlled trials with equal probability of assignment to treatment and control, the user can simply define propensity.func as: R> propensity.func.const <-function(x, trt) 0.5 which always returns the constant 1/2.
For cases with multi-category treatments, the user must specify a propensity function that returns P(T = T i |X = x) for patient i. In other words, it should return the probability of receiving the treatment that was actually received for each patient. For example, the below code uses nnet::multinom, whose predict method returns a matrix of probabilities with one column for each treatment level. We produce code below that returns the probability corresponding to the treatment that was actually observed for each observation.
Note that positive continuous outcomes may be used for "poisson_loss" as well. In general there are fewer restrictions in theory about the types of outcomes used for the above losses, however the imposed restrictions in this package are due to implementation limitations.
loss The loss argument specifies the combination of M function (i.e., loss function) and underlying model f (X). The name of each possible value for loss has two parts: The first part corresponds to the M function and the second part corresponds to f (X) and whether variable selection via the lasso is used. The available M functions are listed in Table 2.
All loss options that have "lasso" in their suffix use f (X) = X β and have the penalty term p j=1 |β j | added to the overall objective function L W (f ) or L A (f ). Adding the penalty term makes the benefit score estimatef (X) = X β sparse in the sense that some elements ofβ will be exactly zero, allowing a simpler form of the benefit score. An example is "sq_loss_lasso", which corresponds to using M (y, v) = (y − v) 2 , a linear form of f , i.e., f (X) = X β, and an additional penalty term p j=1 |β j | added to the loss function for variable selection. All options containing "lasso" in the name use the cv.glmnet() function of the glmnet package (Friedman et al. 2021) for the underlying model fitting and variable selection. A K-fold crossvalidation is used to select the penalty tuning parameter. Please see the documentation of cv.glmnet() for information about other arguments which can be passed to it.
Any options for loss which end with "lasso_gam" have a two-stage model. Variables are selected using a linear or GLM in the first stage and then the selected variables are used in a generalized additive model in the second stage. Univariate nonparametric smoother terms are used in the second stage for all continuous variables. Binary variables are used as linear terms in the model. All loss options containing "gam" in the name use the gam() function of the R package mgcv (Wood 2017(Wood , 2019. Please see the documentation of gam() for information about other arguments which can be passed to it.
All options that end in "gbm" use gradient-boosted decision trees models for f (X). Such machine learning models can provide more flexible forms of estimation by essentially using a sum of decision trees models. However, these "black box" models can be more challenging to interpret. The gbm-based models are fit using the gbm R package (Greenwell, Boehmke, Cunningham, and GBM Developers 2020). Please see the documentation for the gbm() function of the gbm package for more details on the possible arguments. Tuning the values of the hyperparameters shrinkage, n.trees, and interaction.depth is crucial for a successful gradient-boosting model. These arguments can be passed to the fit.subgroup() function. By default, when gbm-based models are used, a plot of the cross-validation error versus the number of trees is displayed. If this plot has values which are still decreasing at the maximum value of the number of trees, then it is recommended to either increase the number of trees (n.trees), the maximum tree depth (interaction.depth), or the step size of the algorithm (shrinkage).
The loss "owl_hinge" options are based on the hinge loss function and thus correspond to support vector machine type of optimization procedures. Optimization of hinge-based losses is done via the kernlab package (Karatzoglou, Smola, Hornik, and Zeileis 2004;Karatzoglou, Smola, and Hornik 2019). As such, the underlying model for f (X) depends on the kernel chosen by the user. A linear kernel will yield a linear decision rule f (X), whereas a nonlinear kernel such as the Gaussian radial basis function kernel will yield a more flexible, nonlinear decision rule. Available kernels are listed in the kernlab package and can be displayed by ?kernels.
The outcome-weighted learning based losses with "flip" in their name allow for non-positive outcomes. In these cases they may offer substantial finite sample efficiency gains compared with using the original outcome-weighted learning losses and shifting the response such that it is positive.
method The method argument is used to specify whether the weighting or A-learning method is used. Specify "weighting" for the weighting method that uses L W (f ) and "a_learning" for the A-learning method that uses L A (f ).
match.id This argument allows the user to specify that the analysis dataset is based on matched groups of cases and controls. If used, it should be either a character, factor, or integer vector with length equal to the number of observations in x indicating which patients are in which matched groups. Defaults to NULL and assumes the samples are not from a matched cohort. Matched case-control groups can be created using any method such as propensity score matching, optimal matching, etc. (Imbens and Rubin 2015). If each case is matched with a control or multiple controls, this would indicate which case-control pairs or groups go together. If match.id is supplied, then it is unnecessary to specify a function via the propensity.func argument. A quick usage example is: If the first patient is a case and the second and third are controls matched to it, and the fourth patient is a case and the fifth through seventh patients are matched with it, then the user should specify match.id = c(1, 1, 1, 2, 2, 2, 2) or match.id = rep(c("Grp1", "Grp2"), c (3, 4)).
augment.func The augment.func argument is used to allow the user to specify an efficiency augmentation function. The basic idea of efficiency augmentation is to construct a model for the main effects of the outcome model and shift the outcome based on these main effects. The resulting estimator based on the shifted outcome can be more efficient than using the outcome itself.
For the same reason that the propensity.func must be specified as a function, the user should specify a wrapper function for augment.func which inputs the covariate information x and the outcome y and outputs a prediction for the outcome for each observation in x. The predictions should be returned on the link scale, in other words on the scale of the linear predictors. The augmentation function may be from a nonlinear or nonparametric model, however the predictions should still be returned on the link scale. An example of an augmentation function that uses linear regression with a lasso penalty for this model is as follows: R> augment.func.simple <-function(x, y) { + cvmod <-cv.glmnet(y = y, x = x, nfolds = 10) + predictions <-predict(cvmod, newx = x, s = "lambda.min") + predictions + } A more involved example that models the full conditional outcome E[Y |T, X] and integrates over the treatment levels is given by the following code, which obtains predictions for each observation under both treatment levels and averages these predictions: R> augment.func <-function(x, y, trt) { + data <-data.frame(x, y, trt = ifelse(trt == "Trt", 1, -1)) + xm <-model.matrix(y~trt * x -1, data = data) + cvmod <-cv.glmnet(y = y, x = xm) + data$trt <-1 + xm1 <-model.matrix(y~trt * x -1, data = data) + preds_1 <-predict(cvmod, xm1, s = "lambda.min") + data$trt <--1 + xm2 <-model.matrix(y~trt * x -1, data = data) + preds_n1 <-predict(cvmod, xm2, s = "lambda.min") + return(0.5 * (preds_1 + preds_n1)) + } For binary outcomes, one must define the augmentation function such that it returns predictions on the link scale as follows: R> augment.func.bin <-function(x, y) { + cvmod <-cv.glmnet(y = y, x = x, family = "binomial") + predict(cvmod, newx = x, s = "lambda.min", type = "link") + } Then the defined augmentation function can be used in fit.subgroup() by passing the function to the argument augment.func. A usage example using the above-defined augmentation function is the following: The first estimate is the treatment main effect, which is always selected. Any other variables selected represent treatment-covariate interactions.  Chen et al. (2017), the optimal efficiency augmentation function may depend on the treatment statuses. Hence the user is allowed to specify augment.func as additionally a function of trt, i.e., augment.func <-function(x, y, trt).
fit.custom.loss The fit.custom.loss argument allows the user to provide a function which minimizes a custom loss function for use in the fit.subgroup() function. The loss function, M(y, v), to be minimized must meet the criteria outlined in Section 2.2. The user must provide as fit.custom.loss a function which minimizes a sample weighted version of the loss function and returns a list with the solution of the minimization in addition to a function which takes covariates as an argument and returns predictions of the benefit scorê f (x) under the estimator resulting from the minimization of the custom loss. If provided, this function should take the modified design matrix and the responses as argument s and optimize a custom weighted loss function. The provided function must be a function with the following arguments: • x -design matrix.
• y -vector of responses.
• weights -vector for observations weights. The underlying loss function must have samples weighted according to this vector. See the example below.
• ... -additional arguments passed via .... This can be used so that users can specify more arguments to the underlying fitting function via fit.subgroup() if so desired.
The provided function must return a list with the following elements: • predict -a function that inputs a design matrix and a type argument for the type of predictions and outputs a vector of predictions on the scale of the linear predictor. Note that the matrix provided to fit.custom.loss has a column appended to the first column of x corresponding to the treatment main effect. Thus, the prediction function should deal with this, e.g., predict(model, cbind(1, x)). • model -a fitted model object returned by the underlying fitting function. This can be an arbitrary R object. • coefficients -if the underlying fitting function yields a vector of coefficient estimates, they should be provided here.
The provided function can also optionally take the following arguments which may be optionally used in the custom fitting routine: • match.id -vector of case/control cluster identifiers. This is useful if cross-validation is used in the underlying fitting function in which case it is advisable to sample whole clusters randomly instead of individual observations. • offset -if efficiency augmentation is used, the predictions from the outcome model from augment.func will be provided via the offset argument, which can be used as an offset in the underlying fitting function as a means of incorporating the efficiency augmentation model's predictions. • trt -vector of treatment statuses.
• family -family of outcome.
An example of fit.custom.loss is a minimization of the exponential loss M(y, v) = y exp(−v) for positive outcomes. In the following code, we first define the weighted loss function with a linear form for the benefit score as the function expo.loss. We then use the optim() function to minimize this function to obtain coefficient estimates and then define pred to be a function which returns predicted benefit scores. The function finally returns a list with the prediction function, the returned optimization object, and the estimated coefficients.
reference.trt As already mentioned for trt, the user may specify which level of the treatment should be the reference via the reference.trt argument. For example, if trt has the levels c("TrtA", "TrtB", "Ctrl"), setting reference.trt = "Ctrl" will ensure that "Ctrl" is the reference level. This argument is not used for multi-category treatment fitting with OWL-type losses, as the underlying multinomial outcome-weighted model is parameterized such that there is not a reference treatment group. This parameterization is described in Friedman, Hastie, and Tibshirani (2010).
cutpoint The cutpoint is the numeric value of the benefit score f (X) above which patients will be recommended the treatment. In other words for outcomes where larger values are better and a cutpoint with value c if f (x) > c for a patient with covariate values X = x, then they will be recommended to have the treatment instead of recommended the control. If lower values are better for the outcome, c will be the value below which patients will be recommended the treatment (i.e., a patient will be recommended the treatment if f (x) < c). By default the cutpoint value is 0. Users may wish to increase this value if there are limited resources for treatment allocation. The cutpoint argument is available for multi-category treatments and is still a single value applied to each comparison with the reference treatment.
The user can also set cutpoint = "median", which will use the median value of the benefit scores as the cutpoint. Similarly, the user can set specific quantile values via "quantx" where "x" is a number between 0 and 100 representing the quantile value; e.g., cutpoint = "quant75" will use the 75th percent upper quantile of the benefit scores as the cutpoint value.
retcall The argument retcall is a Boolean variable which indicates whether to return the arguments passed to fit.subgroup(). It must be set to TRUE if the user wishes to later validate the fitted model object from fit.subgroup() using the validate.subgroup() function. This is necessary because when retcall = TRUE, the design matrix x, response y, and treatment vector trt must be re-sampled in either the bootstrap procedure or training and testing resampling procedure of validate.subgroup(). The only time when retcall should be set to FALSE is when the design matrix is too big to be stored in the fitted model object.
... The argument ... is used to pass arguments to the underlying modeling functions. For example, if the lasso is specified in the loss argument, ... is used to pass arguments to the cv.glmnet() function from the glmnet package. If gam is present in the name for the loss argument, the underlying model is fit using the gam() function of mgcv, so arguments to gam() can be passed using .... The only tricky part for gam() is that it also has an argument titled method and hence instead, to change the method argument of gam(), the user can pass values using method.gam which will then be passed as the argument for method in the gam() function. For all loss options with "hinge", this will be passed to both weighted.ksvm() from the personalized package and ipop from the kernlab package.

Binary outcomes
All loss options for continuous outcomes can also be used for binary outcomes. Additionally, the loss argument options that are exclusively available for binary outcomes are: • "logistic_loss_lasso" • "logistic_loss_lasso_gam" • "logistic_loss_gam" • "logistic_loss_gbm" R> subgrp.bin <-fit.subgroup(x = x, y = y.binary, trt = trt, + propensity.func = propensity.func.lasso, + loss = "logistic_loss_lasso", nfolds = 10) When gradient-boosted decision trees are used for f (X) by the package gbm, care must be taken to choose the hyperparameters effectively. Specifically, shrinkage (similar to the stepsize in gradient descent), n.trees (the number of trees to fit), and interaction.depth (the maximum depth of each tree) should be tuned according to the data at hand. By default for gradient-boosting models, fit.subgroup() plots the cross-validation error versus the number of trees to enable the users to assess their choice of tuning parameters.

Count outcomes
All loss options for continuous outcomes can also be used for count outcomes. Additionally, the loss argument options that are exclusively available for count outcomes are: • "poisson_loss_lasso" • "poisson_loss_lasso_gam" • "poisson_loss_gam" • "poisson_loss_gbm"

Time-to-event outcomes
The loss argument options that are available for time-to-event outcomes are: • "cox_loss_lasso" • "cox_loss_gbm" For subgroup identification models for time-to-event outcomes, the user should provide function fit.subgroup() with a 'Surv' object of the survival package for y. This can be done as follows: R> library("survival") R> set.seed(123) R> subgrp.cox <-fit.subgroup(x = x, y = Surv(y.time.to.event, status), + trt = trt, propensity.func = propensity.func.lasso, + loss = "cox_loss_lasso", nfolds = 10) The subgroup treatment effects are estimated using the restricted mean survival time statistic (Irwin 1949;Tsiatis 1997, 1999;Chen and Tsiatis 2001) and can be displayed with the summary or print methods for the 'subgroup_fitted' objects returned as follows: The first estimate is the treatment main effect, which is always selected. Any other variables selected represent treatment-covariate interactions.

The summarize.subgroups() function for summarizing subgroups
The summarize.subgroups() function provides a quick way of comparing the covariate values between the subgroups recommended the treatment and the control respectively. p values for the differences of covariate values between subgroups are computed and adjusted for multiple comparisons using the approach of Hommel (1988). For continuous variables the p values come from t tests and for discrete variables the p values come from chi-squared tests. The p values are computed and used as a means to filter out covariates without meaningful differences between subgroups, however they are not displayed as they do not represent valid statistical inferences due to their post-hoc nature.

R> comp <-summarize.subgroups(subgrp.cox)
The user can optionally print only the covariates which have "significant" differences between subgroups with a multiple comparisons-adjusted p value below a given threshold like the following:

The validate.subgroup() function for evaluating identified subgroups
It is crucial to evaluate the findings by assessing the improvement in outcomes with the estimated subgroups. Ideally, the treatment should have a positive impact on the outcome within the subgroup of patients who are recommended the treatment and the control should have a positive impact on the outcome within the subgroup of patients who were not recommended the treatment.
In general it is quite challenging to obtain valid estimates of these effects because usually only one dataset is available. Using data twice, or taking the average outcomes by treatment status within each subgroup (using the same data) to estimate the treatment effects, will yield biased and typically overly-optimistic estimates of the subgroup-specific treatment effects. Therefore, as described in Section 2.6, we use resampling-based procedures to alleviate this phenomenon and hope to estimate these effects reliably. The personalized package offers two methods for subgroup treatment effect estimation. Both methods are available via the validate.subgroup() function.

Repeated training/test splitting
The first method of subgroup-specific treatment effects available in validate.subgroup() is prediction-based and requires multiple replications of data partitioning. For each replication in this procedure, data are randomly partitioned into training and testing portions. Then the subgroup identification model is estimated using the training portion and the subgroup treatment effects are estimated via empirical averages within subgroups using the testing portion. This method requires two arguments to be passed to validate.subgroup(). The first argument is B, the number of replications and the second argument is train.fraction, the proportion of samples used for training. Hence 1 -train.fraction is the portion of samples used for testing.
The main object which needs to be passed to validate.subgroup() is a fitted object returned by fit.subgroup(). Note that in order to validate a fitted object from fit.subgroup(), the model must be fit with the fit.subgroup() argument retcall set to TRUE because the data passed to fit.subgroup() must be accessed. The validate.subgroup() function uses the same arguments that were passed to the original call of fit.subgroup() for fitting during each replication.
The validation process is carried out by fixing the cutpoint value at the user specified cutpoint from the call to fit.subgroup(). However, especially in scenarios with very costly treatments, it may be of interest to investigate the treatment effects within subgroups defined by different cutpoints along the range of the benefit score. To simultaneously run the validation procedure for subgroups defined by different cutpoints of the benefit score, the user can specify a vector of benefit score quantiles to validate.subgroup() via the benefit.score.quantiles argument. For example, setting benefit.score.quantiles = c(0.5, 0.75) will yield validation results for subgroups defined by a median cutoff value for the benefit score and a cutoff value at the 75th quantile of the benefit score, the latter of which will result in a smaller subgroup assigned to the treatment that will ideally have a larger treatment effect. The default value for benefit.score.quantiles is the vector c(1/6, 2/6, 3/6, 4/6, 5/6). The results of this can be accessed by setting the argument type = "conditional" for the plot method of 'subgroup_validated' objects or by specifying a vector of indexes via the argument which.quant of the print method of 'subgroup_validated' objects.
R> class(subgrp.model.eff) [1] "subgroup_fitted" Here the argument B specifies the number of replications, method indicates what estimation method to use, and benefit.score.quantiles specifies which quantiles of the benefit score to use as cutpoints in addition to 0.

Bootstrap bias correction
The second method for estimation of subgroup-conditional treatment effects described in Section 2.6 and available in validate.subgroup() is the bootstrap bias correction method. The bootstrap bias correction method can be accessed via the validate.subgroup() function as follows:

Evaluating performance of subgroup-specific treatment effect estimation
We now generate an independent dataset from the same data-generating mechanism of the simulation in order to evaluate how well the subgroup-specific treatment effects are estimated by validate.subgroup().
R> x.test <-matrix(rnorm(10 * n.obs * n.vars, sd = 3), 10 * n.obs, n.vars) R> xbetat.test <-0. We then use the predict() function for objects returned by fit.subgroup() to obtain the estimated benefit scores for the test data: R> bene.score.test <-predict(subgrp.model.eff, newx = x.test) Finally we evaluate the subgroup-specific treatment effects on the test data based on the estimated subgroups and compare these values with confidence intervals from the bootstrap bias correction method. The effect of control among those recommended control is obtained by:  2.5% 97.5% 3.266023 11.345697 We can see that the true values are contained within the bootstrap confidence intervals.

The plot() method and the plotCompare() function
The outcomes (or average outcomes) of patients within different subgroups can be plotted using the plot() function. In particular, this function plots patient outcomes by treatment statuses within each subgroup of patients. Boxplots of the outcomes can be plotted in addition to densities and interaction plot of the average outcomes within each of these groups. They can all be generated using code like the one below with resulting plots in Figures 2, 3, and 4: R> plot(subgrp.model) R> plot(subgrp.model, type = "density") R> plot(subgrp.model, type = "interaction") For objects of class 'subgroup_fitted', specifying the argument type = "conditional" in the plot() function displays smoothed means of the outcomes conditional on each treatment group as a function of the benefit score. Thus, a meaningful subgroup will be revealed if the conditional means of the treated and untreated groups are not parallel in the benefit score. The conditional plot generated from the below code is in Figure 5.
R> plot(subgrp.model, type = "conditional") Multiple models can be visually compared using the plotCompare() function, which offers the same plotting options as the plot method for 'subgroup_fitted' objects.

Efficiency augmentation
We now run the repeated training and testing splitting procedure on the model for continuous outcomes that did not utilize efficiency augmentation so we can compare with the efficiencyaugmented model: The results across the iterations for either the bootstrap of the training and testing partitioning procedure can be plotted using the plot() function similarly to how the plot() function can be used for fitted objects from fit.subgroup(). Similarly, boxplots, density plots, and interaction plots are all available through the type argument. Example code is below with resulting plot shown in Figure 6. For the sake of space, we do not show the density plot.
R> plot(validation) R> plot(validation, type = "density") Specifying the argument type = "conditional" plots the validation results conditional on different cutoff values for the benefit score as specified to validate.subgroup() via the benefit.score.quantiles argument. The resulting conditional plot generated by the below code is shown in Figure 7.

R> plot(validation, type = "conditional")
Multiple validated models can be visually compared using the plotCompare() function, which offers the same plotting options as the plot method for 'subgroup_validated' objects. Here we compare the model fitted using "sq_loss_lasso" to the one fitted using "sq_loss_lasso" and efficiency augmentation. The resulting plot is shown in Figure 8.

R> plotCompare(validation, validation.eff)
From this comparison plot we can see that the efficiency-augmented model provides estimated subgroups that result in better overall outcomes when the recommended treatment is indeed the treatment received.

Example with multi-category treatments
First we simulate data with three treatments. The treatment assignments will be based on covariates and hence mimic an observational setting with no unmeasured confounders. The term delta1 below is the effect of treatment 1 relative to treatment 3 and delta2 is defined similarly for treatment 2. The main effects have nonlinearities.
R> set.seed(123) R> n.obs <-1000; n.vars <-100 R> x <-matrix(rnorm(n.obs * n.vars, sd = 3), n.obs, n.vars) R> xbetat_1 <-0. ] R> xbeta <-xbeta + delta1 * ((trt.num == 1) -(trt.num == 3)) + + delta2 * ((trt.num == 2) -(trt.num == 3)) R> y <-xbeta + rnorm(n.obs, sd = 2) We will use the factor version of the treatment status vector in our analysis, however, the integer values vector, i.e., trt.num, could be used as well. Then we construct a propensity score function that takes covariate information and the treatment statuses as input and generates a matrix of probabilities as output. Each row i of the output matrix represents an observation and each column j is the probability that the i-th patient received the j-th treatment. The treatment levels are ordered alphabetically (or numerically if the treatment assignment vector is a vector of integers). Our propensity score model in this example will be a multinomial logistic regression model with a lasso penalty for the probability of treatment assignments conditional on covariate information: Histograms of propensity scores by treatment group Figure 9: Propensity score overlap plot for multi-category treatment data.
R> propensity.multinom.lasso <-function(x, trt) { + if (!is.factor(trt)) trt <-as.factor(trt) + gfit <-cv.glmnet(y = trt, x = x, family = "multinomial") + propens <-drop(predict(gfit, newx = x, type = "response", + s = "lambda.min")) + probs <-propens [, match(levels(trt), colnames(propens))] + probs + } An important assumption for the propensity score is that 0 < P(T i = t|X) < 1 for all X and t. This assumption, often called the positivity assumption, is impossible to verify. However, in practice validity of the assumption can be assessed via a visualization of the empirical overlap of our estimated propensity scores to determine if there is any evidence of positivity violations. The check.overlap() function also allows us to visualize the overlap of our propensity scores for multi-category treatment applications. The following code results in the plot shown Figure 9. R> check.overlap(x = x, trt = trt, propensity.multinom.lasso) Each plot in Figure 9 is for a different treatment group, e.g., the plot in the first row of plots is the subset of patients who received treatment 1. There seems to be no obvious evidence against the positivity assumption.
As the outcome is continuous and there is a large number of covariates available for our construction of a benefit score, we will use the squared error loss and a lasso penalty. The model can be fit in the same manner as for the binary treatment setting, however only linear models and the weighting method are available. Here we can also specify the reference treatment (the treatment that the non-reference treatments are compared with by each benefit score). Here we specify that the reference treatment level is "Trt_3".
7 out of 100 variables selected for delta 1 by the lasso (cross validation criterion). The summary() function now displays selected variables for each of the two benefit scores and shows the quantiles of each benefit score. We can also plot the empirical observations within the different subgroups using the plot() function, however now it is slightly more complicated. It appears that the average outcome is higher for those who received the level of the treatment they were recommended than those who received a different treatment than they were recommended. Also note that the plot method for 'subgroup_fitted' objects returns a 'ggplot' object (Wickham 2016;Wickham, Chang, Henry, Pedersen, Takahashi, Wilke, Woo, Yutani, and Dunnington 2021) and can thus be modified by the user. The below example yields Figure 10.
R> pl <-plot(subgrp.multi) R> pl + theme(axis.text.x = element_text(angle = 90, hjust = 1)) To obtain valid estimates of the subgroup-specific treatment effects, we perform the repeated training and testing resample procedure using the validate.subgroup() function:  Figure 11: Validation results for multi-category treatment data.
f_Trt_3(x): 0 maxval = max(f_Trt_1(x), f_Trt_2(x)) which.max(maxval) = The trt level which maximizes maxval Trt recom = which.max(maxval)*I(maxval > c) + Trt_3*I(maxval <= c) where c is cutpoint Setting the sample.pct argument above to TRUE prints out the average percent of all patients which are in each subgroup (as opposed to the average sample sizes). We can see that about 58% of patients were recommended treatment 1 and among those recommended treatment 1, we expect them to have larger outcomes if they actually receive treatment 1 as opposed to the other treatments. The estimated effects are positive within all three subgroups (meaning those recommended each of the different treatments have a positive benefit from receiving the treatment they are recommended as opposed to receiving any another treatments).

Numerical comparisons
In this section we evaluate the finite sample performance of many different available methods in the personalized package and comparative methods available in other packages via a set of numerical studies. These comparative methods, outside the scope of the personalized package, are the residual-weighted learning (RWL) method of Zhou, Mayer-Hamblett, Khan, and Kosorok (2017) as implemented in the DynTxRegime package, the 1-PLS approach of Qian and Murphy (2011, outcome-lasso), an outcome-modeling approach based on Bayesian additive regression trees (outcome-BART), and the model-based trees and forests approach of Hothorn (2016, 2017) implemented in the model4you package (Seibold, Zeileis, and Hothorn 2021). Comparison with more packages and methods would be ideal, however the majority of available packages either do not provide functions for prediction for new patients or are too computationally demanding. The methods from the personalized package utilized in this comparison are the A-learning method with the square loss and a lasso penalty (Sq-A); the weighting method with the square loss and a lasso penalty (Sq-W); the weighting method the flipped outcome weighted learning logistic loss and a lasso penalty, i.e., M(y, v) = |y| log(1 + exp{−sign(y)v}), (FOWL-L-W); and the weighting method the flipped outcome weighted learning hinge loss, i.e., M(y, v) = |y|max(0, 1 − sign(y)v), (FOWL-H-W). We additionally compare with loss-augmented versions of all of these aforementioned methods; the corresponding names have "-Aug" appended to the end, e.g., "Sq-W-Aug".
Covariates were generated as X = (X 1 , . . . , X p ) , where X 2 , X 4 , X 6 , . . . , X 40 are binary random variables with probability of success 0.25, and the remaining p − 20 elements of X are generated from a multivariate normal random variable with variance-covariance matrix 1 on the diagonal and ρ |i−j| for the element in the i-th row and j-th column with ρ = 0.75. Treatment statuses are generated from an observational study type setting with P(T = 1|X = x) = expit(β T 0 + β T x), where expit(x) = 1/(1 + e −x ) and the first 10 elements in β T are generated from a uniform random variable on [−0.5, −0.25] ∪ [0.25, 0.5] and the rest are 0. The intercept β T 0 is set such that on average 1/3 of observations would receive T = 1. The responses are generated from the following two models: Model 1: Y = γ X + T β X + Model 2: Y = exp(0.5γ X)−exp(0.5{ν 1 X 1 X 2 +ν 2 X 1 X 3 +ν 3 X 2 X 3 +ν 4 X 3 X 4 +ν 5 X 5 X 6 })+ T β X + , For all methods that apply variable selection, the lasso is used and 10-fold cross-validation based on mean-squared error is used for selection of the tuning parameter. For all methods that require the use of a propensity score, the propensity score is created by fitting a binary logistic regression model with a lasso penalty. For all modeling options in the personalized package that do not use the hinge loss, the lasso is used for variable selection. For all methods in the personalized package that use outcome augmentation, a linear model Y ∼ x + x:trt with the lasso is used to create the augmentation part. The treatment-covariate interactions are included in this function so that the main effects can be correctly specified. However, as the goal in subgroup identification is to estimate treatment-covariate interactions, the augmentation function we use averages over the predictions for trt = 1 and trt = -1. We note that, under Model 2, this augmentation function is misspecified.
For the BART approach, we use the BayesTree package (Chipman and McCulloch 2016) with all covariates and the treatment indicator included and estimate the benefit score ∆(x) for each patient by evaluating the difference in predictions for trt = 1 versus trt = -1. We use the default settings in the BayesTree package as the default settings are well-known to perform admirably (Chipman, George, McCulloch et al. 2010). Similarly, with the 1-PLS approach, we fit a linear model with a lasso penalty for the outcome, including covariate main effects and treatment-covariate interactions. The benefit score is estimated in the same way as the BART approach. The augmentation function needed in creating the residuals for the RWL method is constructed in the same way as that used for the outcome augmentation of the personalized package.
The methods are evaluated by investigating the rank correlation between the true benefit score ∆(x) with its estimate f (x) (or a monotone transformation of an estimate of ∆(x)) on independent test sets of size 10000. Methods are also evaluated by their area under the receiver operating characteristic curves (AUC) with respect to the true underlying subgroups. The results are displayed in Figures 12 and 13, where "ME size: large" indicates the main effects are large, i.e., c = 4/3, and "ME size: small" indicates c = 2/3. The vertical dashed line separates methods from the personalized package and other methods. We did not include results for the outcome weighted learning losses, only the flipped versions of the outcome weighted learning losses as the flipped versions were uniformly better. Similarly, the tree-based version of the approach in model4you is not included, since the forest version of the method of model4you is uniformly better. However, the results are available in the supplementary material. Under Model 1, the 1-PLS method is correctly specified thus can serve as the gold standard in terms of performance. However, under Model 2, the main effects are non-linear and the 1-PLS model is incorrectly specified. Under Model 2, when the main effect size is small, the interaction effects dominate the main effects in size and outcome modeling approaches such as the 1-PLS method are more robust to model misspecification than for large main effect sizes under Model 2.
We can see that augmentation with correct specification (Model 1) in the personalized package can boost performances although not necessarily so with incorrect specification (Model 2). Understandably, scenarios with larger main effects are associated with worse performance for all methods. Of the two tree-based ensemble approaches, model4you-Forest tends to perform the best under most scenarios. Under Model 2 and large main effect sizes, the flipped outcome weighted learning approach with the hinge loss and no loss augmentation works very well and has the lowest variance. The flipped outcome weighted learning approach with loss augmentation with the logistic loss is never the best in any setting, however it is close to the best in all scenarios and is thus a reasonable choice in data scenarios where not much is known about the problem a priori. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Analysis of National Supported Work study
In this section we conduct a subgroup identification analysis for a study of the effectiveness of a training program designed to help under-served and under-employed workers gain the requisite skills for employment. The data came from the National Supported Work Study (LaLonde 1986). The outcome of interest is whether or not the earnings of individuals are greater in 1978 than in 1975 before the training program. The first estimate is the treatment main effect, which is always selected. Any other variables selected represent treatment-covariate interactions.
Trt hispYes marrYes Estimate 0.1334Estimate 0. -0.3367 0.1825 To evaluate the impact of the estimated subgroups, we randomly split the data into a training portion (80%) and a testing portion (the remaining 20%), fit a benefit score model on the training portion, and evaluate the treatment effects within the estimated subgroups on the testing portion with 500 replications. This allows us to determine whether using our benefit score to make treatment decisions for patients will result in better outcomes. The results over the 500 training and testing replications are displayed in Figure 14.

R> plot(val_subgrp_w)
The plotCompare() function allows us to inspect the treatment effects within subgroups on the testing datasets across all repetitions of the validation procedure in comparison with the estimated subgroup treatment effects using the training data. This comparison, with an additional comparison with the estimates generated from the bootstrap bias correction approach, is displayed in Figure 15. The estimates of the subgroup treatment effects based on the training data is likely overly-optimistic. However, the training/testing procedure allows us to correct a possible overfitting. We can see that among those who are recommended to receive the employment training, those who actually received the employment training were more likely to have a higher salary than those who did not receive the training. On the other hand, among those who were not recommended the training, those who did not receive the training may have a slightly higher salary. However there seem to be much more variation. Indeed the estimate of the benefit of employment training is attenuated for the validation-based estimates compared with the biased empirical estimates. Across the replications approximately 91% of the samples were recommended to receive employment training.