Inference of subgroup-level treatment effects via generic causal tree in observational studies

A tree-based algorithm for subgroup identification with high interpretability allows valid inference for tree estimators.


Introduction
Correlation analysis is used to be the primary way to discover relationships between variables; it measures increasing or decreasing trends quantified using a correlation coefficient [1] .However, just looking at correlation may lead to confusing conclusions.For instance, we have known that blood circulating vitamin D levels correlate with a decreased risk of colorectal cancer.While saying a high dose of vitamin D intake could reduce the risk of colorectal cancer is implausible because the relationship between them can be an illusion under the influence of other factors [2] .Correlation has limitations on answering the question if A causes B without confounding, which is of more interest in today's research, and that is what the causation does.
Exploring causality has drawn unprecedented attention spanning many fields, such as public policy, econometrics, and medicine.From the feasibility of the practical application, the scenarios of conducting causal inference is increasingly shifting from randomized experiments to observational studies.Since tailored randomized experiments are usually timeconsuming and expensive, in which all covariates other than the "cause" should ensure to be similar in treated and control groups.However, the observed data is easily accessible and economical.Further, with soaring data size and increasing di-versity of samples, treatment will inevitably lead to different treatment effects for individuals, the so-called heterogeneity.Heterogeneity in treatment effects essentially arises from the interactions between treatment and covariates, making various magnitudes of treatment effects appear in the population.Specifically, treatment may have a negative impact on the whole but with a positive influence on specific subpopulations.Thus, learning heterogeneous treatment effects (HTE) is particularly instructive in modern data analysis.
Owing to the strong capabilities of machine learning (ML) in data mining, several meritorious ML algorithms have sprung up to estimate HTE over the last few years.A brief introduction to related ML methodologies is shown in Section 2. Among the literature, the most popular ML methods generally give pointwise treatment effect estimators, e.g., estimating treatment effect at the individual level, like generalized random forest [3] or meta-learners [4] .We argue that the individual treatment effects (ITEs) are too trivial to utilize in large-scale data analysis, where researchers are more concerned about subpopulations with distinct characteristics and treatment effects.Meanwhile, ITEs cannot provide an intuitive explanation of the heterogeneity mechanism for decisionmakers.Still, there is no unified operation to derive subgrouplevel treatment effects (STEs) by individual treatment effects, especially since the statistical properties of derived STEs may

Related works
Much literature uses ML methods to estimate treatment effects at the individual level.The concept of constructing a forest for causal inference has been prevalent since the work of Wager and Athey [7] , in which they aggregated regression or classification trees to construct a forest to allow for pointwise treatment effect estimation.Afterward, Athey et al. [3] proposed the forest by an adaptive-weighted method based on Random Forest, called Generalized Random Forest (GRF).The target parameter can be obtained by solving a weighted-specific moment condition, where heterogeneity is embedded in the sample weights.Drawing inspiration from the idea of nearest neighbor, a debiased 1-nearest neighbor algorithm was proposed by Fan et al. [8] ; they use a weighted sum of two L-statistics estimators with different sample sizes to eliminate the main order of bias.Künzel et al. [4] put forward a nonparametric framework called X-learner to avoid overfitting caused by the unbalanced sizes of treatment groups.Besides, several works [9][10][11][12] trained neural networks to fit counterfactual functions in a pre-learned representation space, where the distributions of covariates are adjusted to be similar in both treatment groups.
However, in the preceding discussion, we have clarified the limits of pointwise estimation in discovering the heterogeneity mechanism.Nevertheless, several tree-based works allowed for it with reasonable interpretation.Su et al. [13] first advocated building trees for subgroup analysis in randomized trials.Yang et al. [14] generalized the interaction tree in Su et al. [13] to adapt observational data by replacing the treatment effect estimator with three adjusted estimators: inverse probability weighting estimator, G-formula estimator, and doubly robust estimator.Foster et al. [15] applied random forest as a tool to predict treatment effects and take them as the outcome into classification and regression tree [16] to detect the most affected group by treatment.Athey and Imbens [5] first proposed a sound tree-based method that enables valid statistical inference for average treatment effects in subspaces.The confidence intervals can be obtained through linear regression after the tree model is constructed.
The estimation method in this paper is inspired by the literature inferring STEs through linear models.Due to the challenge of inferring ITEs directly, Chernozhukov et al. [17] proposed a generic ML framework to infer the key features of ITEs instead, which includes the subgroup average treatment effects.They post-processed rough estimations of ITEs to obtain STEs through a subgroup-specified linear model, where the subgroups are designated according to some rules, such as the quantiles of ITE estimators.Still, this will lead to an unreasonable grouping when the true subgroups are uneven, or the ITEs differ significantly.Park and Kang [18] proposed a sample splitting linear model to infer STEs, where subgroups are pre-specified by the investigators' scientific grouping hypothesis or clustering algorithm, K-Means.The former grouping method relies on prior knowledge, which may not be reliable in practical problems involving many features-misspecifying effect modifiers or setting wrong splitting points will cause varied results.For the latter data-driven approach, K-Means cannot automatically highlight the vital role of effect modifiers when measuring the similarity of observations in whole covariate space.
In general, defining a scientific grouping strategy while ensuring favorable statistical properties of STE estimators is our main purpose.The ideal situation is to perform subgroup identification based on a data-driven approach without additional prior expertise.Through the tree-based method, this goal could be reached more readily only by making trees adapt to statistical inference for STEs.Suppose the subgroups are determined, constructing confidence intervals reduces to conditional inference.Several papers give related references for the application; see Section 4 for detail.Our approach makes new contributions to subgroup identification through ML methods, automatically forming subgroups and giving the treatment effect estimation and valid inferences.Although the tree-based methods have been questioned on their estimation accuracy since they are generally designed for subgroup analysis, using an average treatment effect to approximate the individual treatment effects of the units belonging to a subgroup.This may lead to the loss of accuracy compared with some modern methods for ITE estimation; however, it instead provides insights into heterogeneous mechanisms more intuitively.Thus, one of our main focuses is to improve the tree's estimation performance.In this paper, we seek to utilize the idea of partialling out by Robinson-style [6] to give a better estimation, provably enhancing the performance of trees.

Identification of treatment effect on subgroups
Suppose the observational data contains i.i.d.samples indexed by .For each sample, we observe , where denotes the -dimensional covariates, and is a binary indicator that means receiving treatment, while means not receiving, and is the outcome.Let and be the potential outcomes [19] under the treatment assignment or 0, respectively.
Several assumptions are required to guarantee the identifiability of subgroup treatment effects in the observational study: a) Consistency assumption: , if . The potential outcome is equal to the observed outcome if the actual treatment takes value .b) Unconfoundedness assumption: , namely conditional on covariates , the potential outcomes are independent of treatment assignment.c) Positivity assumption: for all values for and treatment .This assumption stipulates that each sample has a non-zero probability of receiving both treatments.
Under these assumptions, the expected value of potential outcome could be transformed as conditional mean of observed outcome , which allows us to identify treatment effects using observed data (see Refs. [20, 21] for more details about identifiability conditions).
At the same time, we can derive a useful equation by taking expectation of consistency assumption, where we denote as the treatment effect of , and is the propensity score denoting the probability of receiving treatment.

Estimator of subgroup-level treatment effect
We firstly construct suitable STE estimators that adapt to tree splitting.The observed outcome consists of two main effects implied on two paths: one is the baseline effect from to and the other is the treatment effect from to , so we have Furthermore, the baseline effect can be replaced by according to Eq. (1).
During tree growth, the observations falling into the same node could be viewed as a subgroup.Given a split , a left node and a right node will be formed, then the model for observed outcome on the formed subgroups can be built as where we use the average treatment effect of child node to approximate the treatment effects of units falling into it, e.g., , is the vector of treatment effects on left and right nodes, and denotes the child node that is falling in.Model (2) is constructed following the Robinsonstyle [6] ; there are several pieces of literature [18,22,23] estimating treatment effects based on the same idea from Robinson transformation, which gets rid of the effect of covariates from treatment and outcome.From the perspective of the causal graph, this model blocks the backdoor path from treatment to outcome by controlling the covariates .

E(Y|X)
e(X) For the nuisance functions in model (2), the conditional mean of the outcome , and the propensity score , we use their leave-one-fold estimators as plug-ins into model (2).We randomly divide the indices set of the whole sample into folds with equal sample sizes, , . The nuisance functions are fitted on the sample indexed by the complement of , denoted as , and and are estimated on , .Iterating through , we can obtain the nuisance estimators for the entire sample.Given the estimated nuisances, the treatment effect of subgroups formed by split can be estimated as: To derive the properties for , we put some limitations on the estimated nuisance functions.Here we denote the norm of random variable as , where is the probability distribution of .

L q
(P) ∥Y∥ , where and .For each , the nuisance estimators obey the following conditions: 1 gives the bounded moment conditions of model (2); Assumption 3.2 restricts the convergence rate of nuisance estimators, and this rate could be available for many modern ML methods [22] .A simple way is to fit nuisance functions on the whole training data before tree building.As we obtain the nuisance estimation for each sample, storing and passing them into tree-building process for subsequent use, can reduce the frequency of model fitting and take full efficiency of entire training data.
Given a split , suppose that Assumptions 3.1 and 3.2 hold.The treatment effect estimator defined by Eq. ( 3) is asymptotically normal, √ where is a covariance matrix, and the asymptotic variance of treatment effect of and can be estimated as ) θj where .

Generic causal tree algorithm
In this section, we will introduce the procedures of tree generation based on the above-defined STE estimators.We preserve the core frames of interaction trees [13,14] , including partitioning, pruning, and selection, while use a different splitting criterion and replace the estimating idea.Since the semi-parametric causal effect model in Section 3.2 is suitable for general scenarios with confounding interference and allows multiple machine learning algorithms to fit nuisance functions, and the framework of such tree architecture is generic, thus, we call it a generic causal tree (GCT).

Tree splitting
The central problem is to seek a split that immediately maximizes the heterogeneity on child nodes.For the disjoint subgroups in left and right branches, we wonder if their treatment effects have a significant difference.Analogous to the two-sample hypothesis test, we construct a splitting statistic: where the and are the estimated treatment effects of left and right child nodes formed by a given split , and , Var( θr ) are their estimated variances.The estimators in splitting statistic (6) can be estimated by Eqs. ( 3) and ( 5).Here, the split statistic imposes two constraints on the tree splits: i) maximize the heterogeneity of treatment effects in both subgroups; ii) minimize the uncertainty within each subgroup.We calculate split statistics at all candidate splits, then the split with maximum statistic will be conducted.It should be noted that, there exist some splits that violate above model setting and assumptions during split-searching procedure, we simply regard the estimators in Eq. ( 6) as rough approximations at such times.
We repetitively execute the splitting procedure on newgenerated nodes until meeting a pre-specified terminal condition, for example, reaching the minimum number limitation of observations in each node.Here, both the number of treated units and control units should be constrained to avoid poor estimation performance of treatment effect in the nodes.

Π
We adopt the pruning idea of CART [16] and generate a sequence of subtrees by gradually cutting the weakest branches of the initial tree.The goodness of a given tree is accumulative of splitting statistics at all its splits, reflecting the heterogeneity it carries:

Π
Following the work of Su et al. [13] , the heterogeneity-complexity for a given tree can be defined as where represents a subtree containing all internal nodes of , is the number of tree nodes, and is a penalty parameter.
Let denote a subtree that is rooted in an internal node and comprises all descendants of .Subtree is going to be cut only when the heterogeneity-complexity of improves after pruning, leaving a single node .Then we have During pruning, we iterate through each internal node of a given tree and calculate the corresponding The branch with minimal , denoted as , will be cut off.That means once the degree of penalty grows exactly up to , the weakest split will be truncated while other splits will remain.
Continue the pruning procedure until the initial tree is pruned to be a root node , we can get a sequence of the subtree .Given a penalty parameter ① , we select the optimal tree by evaluating heterogeneity-complex- 1 − α ① Ordinarily, we set , which is the quantile of the asymptotic distribution of Eq. ( 6) when there is no significant difference of treatment effects in two child nodes.
The procedures of building the GCT are shown in Algorithms 1 and 2.

Honest estimation for subgroup treatment effect
One of our central concerns is finding a solution to construct confidence intervals for the tree-based treatment effect estimators.It is still challenging to attempt direct inference since the derived subgroups may change with different tree models.Some literature [24,25] suggested using conditional inference by conditioning on the selected model to solve this issue since the uncertainty of targets of inference is induced by the randomness of the selected model.Also, several papers [5,17,18] employed honest estimation, an approach based on data splitting that usually appears in the context of conditional inference, allowing for valid statistical inference with a given model that is built independently of the data on which inference is based Ref. [26].This combined with the theory of semiparametric framework [22] give clear conditional distributions of our tree estimators and shed light on valid inference.
The key point of honest estimation is employing independent data for different tasks.Recall that in the regular version of GCT, the same dataset is used to decide splits and estimate treatment effects, as described in Section 3.3.Here, in the honest version of GCT, we divide the whole sample into two equal-sized parts.Half of the sample will be used to generate the tree model; this procedure is the same as Algorithm 1; The other half of the sample, called estimation data, will be put into the given tree to estimate treatment effects and not participate in tree building.
Let denote the covariate space.Given a specific partition decided by data , the covariate space is divided into disjoint subspaces .The vector of treatment effect on the formed subgroups will be estimated with the estimation data as if subgroups are pre-specified.For observed outcome, we construct model where is a subgroup indicator for indicating which subgroup belongs to under map .
S es θ The nuisance functions can be replaced by its leave-onefold estimators on estimation data , then the treatment effect can be estimated as where is the partition of estimation data.
1. (Nonempty partition) Let denote the final tree model.The covariate space is devided into disjoint subspaces .We assume that there exists a con-Algorithm 1 Generic causal tree for subgroup identification  8) 20: the index of optimal tree 21: optimal tree 22: return optimal tree Note: The function LeaveOneFold gives leave-one-fold estimators of nuisance; CalculateComplexity gives heterogeneity-complexity by (8).

Algorithm 2 Initial tree build calS tr
, minsize, ηtr δ > 0 Π stant , such that the given tree satisfies the That is, giving a specified partition , .Note that the properties of tree estimators are generally derived based on the simplification of fixed partitions, the following theorem is conditional on a built tree model.Π Theorem 4.1.Given a specified partition , suppose that Assumption 3.1, 3.2, and 4.1 hold, the estimators of subgrouplevel treatment effects are asymptotically normal where is the diagonal covariance matrix.For subgroup , the variance estimator can be consistently estimated by

Simulation
In this part, we conduct two simulation studies to test the performance of our GCT algorithm.The first study compares GCT with two benchmark methods, causal tree [5] and causal interaction tree [14] , concentrating on their subgroup identification and estimation abilities.In the second study, we test the effectiveness of GCT algorithm on statistical inference.We retain the experimental setup and performance metrics similar to those of Ref. [14].

Simulation setup
For the data generating process in the following experiments, we draw i.i.d.samples of .The 5-dimensional covariates vector is generated from , where is a covariance matrix with 1 for diagonal elements and 0.3 for non-diagonal items.We set all variables as confounding to affect both treatment and outcome .The propensity score is designed as: The treatment for each individual is generated from a Bernouilli distribution with probability .Considering the following two cases: (i) Homogeneous setting with same treatment effect for all units.The outcome is generated as:

where
. The treatment effect is equal to 2 on the whole population.

X (4)
(ii) Heterogeneous setting with treatment effects varying from the values of the variable .The outcome is gener-ated as: where .The treatment effect of is equal to 5 if and equal to 2 if .

Performance metrics
We evaluate these tree-based methods on test data with sample size .Following metrics will be reported: The probability of correct trees (corr.tree):Trees that split at correct split variables with correct split times are considered as correct trees.
The number of leaf nodes (num.leaf):The correct number of leaf nodes of a tree should be 1 in the homogeneous case and 2 in the heterogeneous case.
{X (1)  , X (2) , X (3) , X (5) } The number of noise splits (num.noise):It is the number of times that tree splits at noise split variables.In homogeneous cases, all covariates are noise split variables; in heterogeneous cases, are noise split variables.Accuracy of first split (fir.splt.acc):It is the proportion of final trees which make the correct first split.Only applicable to heterogeneity setting.The first split variable is generally of the greatest contribution to heterogeneity.
Pairwise prediction similarity (PPS) [14] : Measures the accuracy of subgroup partition, that is, where and indicate whether the observation and fall in the same subgroup in the true model and the established tree, respectively.

Mean of square error (MSE):
is the true treatment effect and is the estimated treatment effect in test data:

Implementation details
Code for the simulations is available at https://github.com/Caiwei-Zhang/GenericCausalTree.We implement our GCT method in R based upon the user-written split functions in R package and retain the main frame of the code by Yang et al. [14] .There are two options considered in our simulations to fit nuisance functions in GCT and CIT.Besides fitting nuisance functions with the whole dataset before tree building, as we described earlier, the other way is to fit nuisance with the samples in each parent node during tree splitting.The latter enables the capture of the local data characteristics but may also increase the risk of overfitting and slow down tree generation.We will compare the two model-fitting patterns and then select the better one to apply in the data analysis in Section 6.The penalty parameter is chosen to be the 95% quantile of .We set 80% of the data as training data for tree splitting and the remaining 20% as validation data for tree selection.For honest GCT in the second study, we will first take out half of the data for estimation and the remaining half for tree building.

causalTree causalTree
We implement the Causal Tree in the R package ② .The tree splitting and validation criteria need to specify before running the tree.Referring to Yang et al. [14] , two versions of the Causal Tree will be considered, the original version and the tuning version.The original Causal Tree (Org.CT) takes the "honest causal tree" criterion as splitting and validation rules.For the tuning version of Causal Tree (Opt.CT), several alternative splitting and validation rules are provided in package , and we will select the optimal combination of rules with the best average performance by grid tuning ③ .In addition, we will make propensity score weighting in the estimation during the tree-building process to accommodate observational data, as Athey and Imbens suggested [5] .
For the ML algorithms used to fit outcome models and propensity score models, we try gradient boosting machine (GBM) and logistic regression model, respectively.All the ML algorithms are implemented with the same parameter settings.We note that the conclusions about the performance of these methods are ordinarily independent of the particular ML algorithm.Many other powerful ML algorithms can be chosen to fit the nuisance functions, such as random forest.

Comparison of Benchmark methods
We summarize the results and provide general remarks on the strengths and drawbacks of the CT, CIT, and GCT algorithms in our simulations.All three algorithms are implemented following Section 5.3.Table 1 reports the average performance metrics in both heterogeneous and homogeneous settings among 1000 repetitions.
Overall, GCT has excellent subgroup identification and estimation performance, followed by CIT-DR and Opt.CT.GCT slightly outperforms CIT in the heterogeneous setting and both Opt.CT and CIT perform well in the homogeneous setting.In either setting, the estimation error of GCT is the smallest.The performances of CIT under two model-fitting patterns are similar in this experimental setting.But fitting nuisance functions in parent nodes seems slightly enhance the performance of GCT.We note that in a more complicated heterogeneous setting, fitting models in parent nodes during splitting could be a wise choice since the response functions are different for distinct parts of covariate space; fitting models in subspaces can greatly aid the detection of local structures.
In terms of tree splitting in the heterogeneous setting, GCT builds the most correct tree models during 1000 repetitions under both model-fitting patterns.Meanwhile, the first split variable is generally of the greatest contribution to heterogeneity, getting the highest accuracy of the first split shows that GCT has strong ability of detecting heterogeneity.Also, GCT shows good power in subgroup identification with the best PPS.

A X E(Y|X, A)
We notice that the average number of leaf nodes of CIT-DR is less than 2 and the number of noise splits is close to 0, indicates that the splitting of CIT is relatively conservative in the heterogeneous case.This may attribute to the method of treatment effect estimation.The treatment effect estimation of CIT-DR algorithm is based on the propensity score weighed counterfactual prediction.This method pools the treatment and covariates together to fit the counterfactual functions , where the treatment is considered in the same position as the other covariates, which may weaken the effect of treatment.And when fitting this outcome model, the response surfaces corresponding to and interfere, blurring the difference between them and negatively affecting the prediction of .Since the two response surfaces are more different in heterogeneous settings than in homogeneous settings, we find CIT's performance in homogeneous settings is always better than that in heterogeneous setting in the following simulations.At the same time, the pooled data increases the difficulty of fitting outcome model , which may place higher demands on the used ML algorithm.The results in Supporting Information S.2.1 also demonstrate this point: CIT is relatively sensitive to different ML algorithms.In contract, for GCT, we do not directly conduct the prediction involving on when fitting outcome models.We fit , the effect of on , and then remove this part of effect from by partialing out.This method can alleviate the interference caused by obvious difference between response surfaces.And that may be why GCT outperforms CIT in this heterogeneous setting.
Besides, the performance of Opt.CT is far better than that of Org.CT, indicating that the Causal Tree is extremely de-Table 1.The summary of average performance metrics during 1000 repetitions for three tree-based algorithms.The upper panel "Hetero" corresponds to the results in heterogeneous setting; the lower panel "Homo" corresponds to the results in homogeneous setting.The optimal metrics are marked in bold text.② github.com/susananthey/causalTree③ According to the tuning result, the best combination of parameters for Opt.CT is split.Rule = "tstats", split.Honest = "TRUE", cv.option = "matching" in heterogeneous setting, and split.Rule = "tstats", split.Honest = "FALSE", cv.option = "matching" in homogeneous setting.

CT
pendent on parameter tuning.The Org.CT does poorly in both settings since it tends to generate very large trees.As we noted, the causal tree criterion guides tree-splitting according to expected mean square error (EMSE), and it is prone to introduce split even when the two branches have the same treatment effect, as long as the variances of outcome in the treated group and control group reduce.Splitting on the noise variables that only have an impact on the outcome but not on the treatment will also lower the EMSE.In terms of estimation, MSE shown in Table 1 and the boxplot in Fig. 1 highlight the promise of the GCT algorithm for accurate estimations of treatment effects.GCT yields the lowest MSE among these methods in both settings.
To investigate the performance of these methods as sample size grows, we gradually increase the sample size from 1000 to 20000 and report the trend of MSE as shown in Fig. 2. The MSE of GCT and Opt.CT decline steadily as the sample size goes up.In contrast, the MSE by CIT algorithms can not converge under the given sample sizes, and its splitting performance does not improve with the increased sample size.The complete performance metrics are presented in Supporting Information S.2.2 because of limited space, with a brief analysis of tree splitting and estimation.
Further, we conduct additional experiments to evaluate the performance of these methods in different scenarios more comprehensively: (Ⅰ) We have mentioned above that changing the ML algorithm used to fit outcome models will not influence the ranking of these tree-based methods.To illustrate this, we try to fit the outcome models of GCT and CIT by Random Forest (RF) and Generalized Linear Model (GLM) besides GBM.The performance metrics are summarized in Supporting Information S.2.1.It demonstrates that the choice of ML algorithms will affect the performance of trees but could hardly change the ranking of trees: GCT has better performance regardless of ML methods.It also indicates that CIT is relatively sensitive to ML methods with larger fluctuations in the subgroup identification performance.(Ⅱ) The results show that all the three methods are negatively impacted by unobserved confounding.In terms of tree splitting, the performance of CIT drops the most, followed by GCT and  Opt.CT drops the least.However, GCT still performs best.For more details, see the results in Supporting Information S.2.3.

The inference by honest GCT
For the second study, we test the effectiveness of honest GCT in terms of inference.We report the coverage rate of 95% confidence intervals on each subgroup in the correct tree models.And we further show the simultaneous coverage rate of 95% simultaneous confidence intervals on both subgroups, following the approach of Hothorn et al. [27] , which allows for the simultaneous inference procedures in semiparametric models, widening confidence intervals by controlling the type I error for each null hypothesis lower than .As we can obtain the confidence intervals of all the nodes in each tree by GCT, we wonder if the confidence interval at each node in a tree could cover its true average treatment effect, even if the tree is incorrectly built.To this end, we report the coverage proportion (Cov.Prop), which shows the proportion of confidence intervals covering the true average treatment effects of nodes in a tree.

N
Result is shown in Table 2.We notice two points from the result.First, when , non-honest GCT performs better than honest GCT.Focused on the heterogeneous setting, the coverage rates of the single confidence interval and simultaneous confidence interval by non-honest GCT are closer to the nominal coverage rate than honest GCT, indicating that halving the sample indeed sacrifices the accuracy of both partition and estimation for honest GCT.But as increases, the performance discrepancies on subgroup identification and estimation between the two versions of GCT narrow, such as the proportion of correct trees and MSE, is almost equal to that of non-honest GCT.In general, both versions of GCT can give valid conditional coverage.Meanwhile, the average percentage of confidence intervals covering the true average treatment effects on the tree nodes is almost 90%.Thus, we can be more confident in inferring the average treatment effect of each node in the tree by GCT.

N = 2000
In Supporting Information S.2.4,we further implement the honest GCT and honest CIT with sample size to compare their performance of inference.The results show that although CIT has good performance in splitting and estimation as before, considering the inference performance, CIT lacks validity, and its coverage rate of 95% confidence intervals is much lower than the nominal coverage rate.Even in the homogeneous case, CIT's coverage rate of confidence intervals is not ideal.

Data analysis
To illustrate the feasibility of GCT algorithm proposed in the proceeding sections, we conduct subgroup identification in an observational study that evaluates the effect of race on the access of opioid use disorder (OUD) treatment in the United States.These data come from the Treatment Episode Data Set: Admissions 2015 (TEDS-A-2015), a program that collected the information of admissions to substance abuse treatment in 2015, which is available on the website of Substance Abuse & Mental Health Data Archive (SAMHDA) ④ .We only use the data of Maryland in the following analysis.The original dataset of Maryland contains information on 107509 participants with 62 variables.After removing the redundant variables and filtering out invalid levels of some variables, the final dataset contains 45396 observations with 26 variables.The processed data provides demographic characteristics such as age, gender, race, marital status, education, pregnancy at the time of admission, veteran status, as well as socio-economic characteristics such as the employment status, living arrangement, the source of income, health insurance, and the primary source of payment.Substance abuse characteristics, such as whether the patient receives medication-assisted opioid therapy, the number of prior treatment episodes, the primary substance abuse problem and the usual route of administration, are also included.In related studies [28,29] the disparities between African Americans and Whites are mainly discussed.Here we set the treatment variable as an indicator for being African American or White, where denotes Whites (29901 participants) and denotes African We apply the honest GCT algorithm to the processed dataset.GBM is used to fit both propensity score and conditional mean of outcome before tree building.In the implementation of honest GCT, the ratio of training data and estimation data is 1∶1; among training data, we extract 20 percent as validation data to perform tree selection.

Analysis
The optimal tree by honest GCT is shown in Fig. 3.The final tree consists of four terminal nodes, splitting at whether or not receive medication-assisted opioid therapy (MAT) (methuse), the primary source of payment (primpay), and the number of prior treatment episodes (noprior), respectively.That means, whether receiving MAT is the most vital variable for racial disparities.Personal financial status and previous substance abuse situations are also critical factors that affect the access to OUD treatment.Contrary to general knowledge, the treatment effect estimator in the whole population (root node) is 0.042, indicating that the average waiting time for Whites is slightly longer than that for African Americans in the whole.But the situation will be different in subrgoups.We find that after the first split, based on whether a patient received MAT, the waiting days for Whites is 0.18 shorter than African Americans in those who received MAT (72 percent of the whole population).For another branch who did not receive MAT (28 percent of the population), African Americans have a priority of 0.61 days over Whites in receiving OUD treatment.The second split is at the primary sources of payment.
Based on the first split, when the primary of payment of patients fall in Self-pay, Health insurance companies, Medicare, Medicaid, or other government payment, which have a proportion of 70 percent, Whites still have a slight priority of 0.078 days in the access to OUD treatment.The most striking thing is that those who do not receive MAT and the primary sources of payment is other, racial disparities to OUD treatment reach the maximum of 3.3 days.Whites have the overwhelming superiority to OUD treatment in this subgroup.
The tree splits by the number of previous treatment episodes at the third level.Besides MAT and the primary source of payment, if participants had no prior treatment episodes, which cover 20 percent of the whole population, the waiting time for Whites is longer than African Americans for 0.29 days.On the other 50 percent participants who have had at least one prior treatment episode, Whites have a priority of 0.21 days compared to African Americans.Although it is habitually believed that African Americans are generally at a disadvantage over Whites in access to OUD treatment, the results reveal that African Americans have priority in receiving OUD treatment in almost half of the participants of Maryland state.Strongest racial inferiority for African Americans only appears in approximately 2 percent of the population who have received MAT and pay by other primary sources for this treatment episode.
Looking at the most affected subgroup in Fig. 4, Whites also show a delay in receiving treatment compared with other subgroups.Yet, African Americans are more affected, so the waiting time still presents a substantial racial discrepancy.In this subgroup, the distribution of days waiting of African Americans is dispersing: nearly half of the African Americans wait for more than two days to enter OUD treatment, while the similar waiting period is rare in other subgroups.Therefore, more attention should be paid to patients who have received MAT and use other sources of payment, especially African Americans.
To further compare with the literature that studies on the effect of racial disparities on the access of OUD treatment, we extract the baseline of valid trees (BVT), which is the most indepth tree among the trees that have positive heterogeneitycomplexities. The BVT is presented in Supporting Information S.3.2.It contains eleven leaf nodes with different treatment effects.Besides the first three splits shown in Fig. 3, the BVT also splits at service setting at admissions, the frequency of use of the primary substance, education, the number of substances, the source of income, and pregnancy.Several of these variables are considered as the most effective variables for racial disparities [29] , demonstrating the rationality of GCT in practical application.

Conclusions
We believe that causal inference at the subgroup level will be more intelligible for decision-making in large-scale data analysis.We gather specific units with similar characteristics and treatment effects for subgroup identification and approximate their individual treatment effects by an average value.We improve the existing tree-based method, called GCT, that identifies subgroups with significant heterogeneity and allows for valid inference of the subgroup treatment effects.As discussed previously, the way of estimating treatment effects on nodes contributes a lot to how trees behave.A semiparametric framework is embedded into GCT for treatment effect estimation and allows us to provide explicit asymptotic properties of the STE estimators by trees.We can further conduct valid conditional inference for the tree estimators combined with honest estimation in observational studies, which has not been achieved before.
We conduct several experiments to compare GCT with the other two benchmark tree-based methods.The GCT al-gorithm performs well in heterogeneous settings than the other two, showing good power in detecting heterogeneity.Another strength of GCT lies in its estimation accuracy; the idea of partialling out can better deal with the interference of confounding on estimation.Moreover, the asymptotic properties of tree estimators provide us with theoretical guarantees when using GCT to conduct inference.Through the simulations, GCT shows good convergence of GCT in subgroup identification and estimation and reaches the nominal coverage rate, which coincides with our tree estimators' theorems.
Depending on the scenarios and goals of data analysis, subgroup identification has continued application prospects.Perhaps it makes practical sense to adapt the model to more complex data types, such as extremely imbalanced data.At the same time, it is also valuable to further improve the operations and relevant theories of statistical inference of tree treatment effect estimators.

Fig. 1 .
Fig. 1.The distribution of MSE over 1000 simulated data sets in heterogeneous (left) and homogeneous settings (right).

Fig. 2 .
Fig. 2. The MSE curves of three tree-based algorithms with sample sizes increased from 1000 to 20000.The curves of CIT-DR-fitBe and CIT-DR-fitIn almost overlap in homogeneous setting.

Fig. 3 .
Fig. 3.The final tree model for TEDS-A (Maryland, 2015).The square bracket under each node is the confidence interval for this subgroup.
size of treated units and control units in node .The function DoEstimation estimate the treatment effect and variance estimator on a node by Eqs.(3) and (5); Splitting statistic is calculated at each candidate split by SplitStatistic through Eq. (6).

Table 2 .
The summary of the performance metrics for honest GCT and non-honest GCT.Conditional on correct trees, we report the coverage American (15495 participants).The outcome variable is the number of waiting days to receive OUD treatment.We truncate the extremely large values of outcome as 99 percent quantile.A detailed description of all variables is provided in Supporting Information S.3.1.
rate of their confidence interval.In heterogeneous setting, Cov1 and Cov2 denote the proportion of times that confidence intervals cover the true treatment effect on each subgroup during 1000 repetitions, and "Covsim" reports the proportion of times that confidence intervals simultaneously cover the true treatment effects on both subgroups during 1000 repetitions.In homogeneous setting, "Cov" denotes the proportion of times that confidence intervals cover the true treatment effect on the root node.The abbreviations for other metrics are described in Table 1.1102-9 10.52396/JUSTC-2022-0054 JUSTC, 2023, 53(11): 1102 Y Y