LATENT CLASS MIXTURE MODELS OF TREATMENT EFFECT HETEROGENEITY FOR POST HOC SUBGROUP ANALYSIS

In randomized experiments, it is often of interest to characterize treatment effect heterogeneity in terms of baseline covariates. Usually, the aim is to identify subpopulations likely to have particularly positive or negative (or neutral) responses to treatment. The process of searching for such subpopulations after the completion of an experiment (without pre-specifying which subpopulations will be considered as candidates) is called ’post hoc subgroup analysis’. It is a controversial practice. Concerns about data dredging and multiple comparisons (Rothwell, 2005) have led many authors to advise against reporting results from post hoc subgroup analyses at all. However, it is our view that post hoc analyses can produce informative insights that would be unlikely to arise from limited pre-registered comparisons. Here we illustrate an approach in which identification of special subgroups is one byproduct of fully modeling treatment effect heterogeneity more generally. By placing shrinkage priors on relevant parameters and applying cross validation based tools for model evaluation and comparison, we minimize data dredging concerns and provide a mechanism for gauging confidence in the substantive implications of model results.


Introduction
(This paper is a work in progress.) In randomized experiments, it is often of interest to characterize treatment effect heterogeneity in terms of baseline covariates.Usually, the aim is to identify subpopulations likely to have particularly positive or negative (or neutral) responses to treatment.The process of searching for such subpopulations after the completion of an experiment (without pre-specifying which subpopulations will be considered as candidates) is called 'post hoc subgroup analysis'.It is a controversial practice.Concerns about data dredging and multiple comparisons (Rothwell, 2005) have led many authors to advise against reporting results from post hoc subgroup analyses at all.However, it is our view that post hoc analyses can produce informative insights that would be unlikely to arise from limited pre-registered comparisons.
Here we illustrate an approach in which identification of special subgroups is one byproduct of fully modeling treatment effect heterogeneity more generally.By placing shrinkage priors on relevant parameters and applying cross validation based tools for model evaluation and comparison, we minimize data dredging concerns and provide a mechanism for gauging confidence in the substantive implications of model results.
Employing parametric probability models of heterogeneity brings certain automatic advantages.The parameter estimates have interpretable implications about the shape of heterogeneity, and models provide estimates of uncertainty about those parameters.Models also provide predictions of treatment effect for future subjects and corresponding uncertainty estimates.And Bayesian models in particular allow us to place hierarchical shrinkage priors on relevant parameters to avoid overfitting/data dredging (Gelman et al., 2012).These are obvious features of parametric probability models and are only worth mentioning because many methods for subgroup analysis are nonparametric or not model based and do not share these features.
Of course, any model is bound to be misspecified, and the advantages described above only apply insofar as the chosen model is a good approximation to reality.We choose a relatively flexible class of models to improve the chances that some model in the class approximates reality well.We employ Leave One Out Cross Validation (LOO-CV) techniques for model comparison and evaluation (Gelfand, 1996;Vehtari and Lumpenin, 2001).For both prediction and inference we prefer models with better LOO-CV estimated expected utility, where a good choice for utility in this context is posterior predictive density.We estimate uncertainty about the expected utility of each candidate model and adjust our degree of belief in each model's implications accordingly.If multiple models could plausibly have the best true expected utility, we do not draw firm conclusions about aspects of heterogeneity on which those models disagree.If one model is clearly superior to the rest and appears to fit adequately based on posterior predictive checks (Gelman, Blei), then we would be fairly confident in its implications.
Specifically, we propose to model treatment effect heterogeneity using regularized Bayesian latent class mixture models with treatment interaction terms and flexible error distributions.In such models, the treatment interaction terms capture 'continuous heterogeneity' while the latent class components capture 'discrete heterogeneity.'By continuous heterogeneity we mean variation in subjects' individual treatment effects that is well approximated by a smooth function of underlying covariates.[[[Come up with better term than continuous.It is confusing that the covariates, and therefore the response surface of estimated treatment effect, may have discrete jumps under a continuous heterogeneity regime.]]]Discrete heterogeneity refers to variation in subjects' individual treatment effects that is associated with latent class membership, where latent class membership may in turn be associated with baseline covariates.Discrete heterogeneity is likely to be present if a treatment works through unobserved causal pathways that may be discretely open or closed.For example, suppose a drug works better in people with a specific phenotype for some protein receptor, but the presence of that phenotype is not recorded as a baseline covariate in a clinical trial evaluating the drug.Further, suppose that a recorded baseline covariate (say, weight) is associated with the presence of the beneficial phenotype.Then treatment effect variation as a function of weight will be better approximated by a latent class model with weight as a predictor of latent class membership than by any smooth function of weight alone.It can sometimes be important to understand which type of heterogeneity is present.
Despite our approach being a fairly straightforward application of Bayesian latent class mixture models and existing model comparison and evaluation techniques, we have not seen it in the subgroup analysis literature.Further, our approach offers a different combination of strengths (and weaknesses) from those methods we have seen.
One general tactic in the literature is to use nonparametric machine learning algorithms (often based on trees) to predict counterfactual outcomes of future subjects under treatment and control.Examples of works in this vein include (Kang et al, 2012;Foster et al., 2010;Su et al, 2009;and others).These methods likely have the edge over ours when it comes to predictive accuracy and flexibility.Some of them also use cross validation to effectively mitigate multiple comparisons concerns.However, they do not provide estimates of uncertainty about their predictions and usually do not characterize the shape of heterogeneity interpretably.
A closely related line of work directly learns optimal treatment assignment rules without first estimating counterfactual outcome response surfaces (Qian and Murphy, 2011;Zhang et al, 2012;Zhao et al, 2012).These methods are not interested in learning about heterogeneity, just assigning the best treatment to each subject.They have similar strengths and weaknesses relative to our method as the machine learning approaches described above.(Imai and Ratkovic, 2012) employ a linear SVM model with interaction terms to model heterogeneity.The output of their model is interpretable, and they place shrinkage penalties on the parameters to discourage overfitting.They use a cross validation measure for model selection.One could replace their SVM with a regression probability model and obtain uncertainty estimates as well.However, they do not directly model discrete heterogeneity and do not consider uncertainty in their model selection criterion.
There have been other examples of latent class mixture models in the literature.In another context, (Sobel and Muthen, 2012) used a logistic-normal latent class mixture model to reflect the assumption that there exists a subpopulation in which the treatment has zero effect.(Shen and He, to appear) recently applied a similar model to identify subgroups and developed a corresponding likelihood ratio test for the existence of latent treatment effect classes.Neither of these approaches allows for continuous effect modification, however, and both are very sensitive to the assumption of a normal error distribution.We use very flexible error distributions so that our estimates are robust to departures from normality.Working in a model evaluation and comparison framework as opposed to Shen's and He's hypothesis testing framework allows us to consider more complex models leading to better fits and more reliable results.
The structure of this paper is as follows.In section 2, we describe our approach in detail, illustrating various features with simulations and by reanalyzing a clinical trial for an HIV treatment that was used as an example in (Shen and He, to appear).Then, having warmed up on a simulated dataset and an obsolete dataset, in section 3 we apply our approach to data of real policy significance from the Oregon Health Insurance Experiment (OHIE).The OHIE was a natural experiment that arose when Oregon instituted a lottery to determine who could enroll in a new Medicaid program with limited openings.This experiment allowed researchers to explore various public health and economic effects of Medicaid.One prominent finding was that Medicaid increased emergency department (ED) utilization contrary to many experts' predictions.The researchers performed multiple preregistered comparisons between subgroups and discovered several heterogeneities.Because the OHIE study contains lots of noncompliers (i.e.lottery winners who did not enroll in Medicaid and lottery losers who managed to enroll through other channels), we follow the researchers in performing an instrumental variable analysis using the principal stratification framework of (Rubin, ?).Our method extends naturally to this framework because principal strata are themselves latent classes.In section 4 we conclude.
Before proceeding, we prominently note a major limitation.Latent class mixture models are not identifiable for categorical outcome distributions such as the Bernoulli (Titterington, 1985).They are identifiable for the Poisson, negative binomial, or almost any continuous outcome distribution, however (Titterington, 1985).

Notation and Model
Specification.We use the potential outcomes framework (Rubin, 1974;Neyman, 1923) which formalizes the notion that each experimental unit (e.g.patient) has a potential outcome for each possible treatment assignment that unit might have received.We get to observe only the potential outcome corresponding to the treatment actually received.We treat the counterfactual potential outcome that would have been observed had the treatment assignment been different as an unobserved random variable.For the i th unit, let Z i ∈ {0, 1} denote the treatment assignment and Y z,i the potential outcome corresponding to the possibly counterfactual treatment assignment Z i = z.Then Y z=1,i − Y z=0,i is the effect of treatment on unit i.Note that this individual level causal effect can never be observed because we never observe both potential outcomes.Still, if we specify a model for the observed data it is possible to estimate parameters that govern treatment effect heterogeneity as a function of covariates.
In Figure 1, we graphically specify our model for the case of full compliance with treatment.[[[[[Distinguish between latent and observed variables in graph.]]]]]The dashed lines in the graph indicate deterministic relationships, and the solid lines indicate stochastic relationships.An interpretation of the nodes is as follows.The potential outcomes for the i th unit under treatment assignment Z = z for z in {0, 1} are denoted Y z,i .Again, for each patient we only observe one of these and the other is treated as missing.The expected potential outcome for a unit with covariates X i and latent class G i under treatment assignment Z = z is denoted by µ z,i .That is, the µ z,i are the marginal expectations of the potential outcomes.φ z are parameters governing the marginal distribution of potential outcomes Y z,i other than their means µ z,i .For example, in a normal model, the φ z may be standard deviations.ρ governs the covariance between the two potential outcomes, but it is completely unidentified because we never observe both potential outcomes for any one unit.µ z=0 is a function of X i , G i , and parameters β C .µ z=1,i = µ z=0,i + ∆ i , where ∆ i denotes the average treatment effect for units with covariates X i and latent class G i .∆ i is a function of X i , G i , and parameters β ∆ and λ ∆ .β ∆ determines how treatment effect varies continuously as a function of covariates, and λ ∆ determines the magnitudes of the discrete differences in treatment effect between latent classes.The probability distribution that generates G i is a function of X i and parameters β G .β ∆ , λ ∆ , and β G are the parameters of interest as together they describe how treatment effect heterogeneity is related to covariates.The parameters σ G , σ ∆ , and σ C are variances for shrinkage priors that we place on relevant parameters to avoid overfitting.We put weakly informative priors on the shrinkage variances themselves so that the appropriate level of shrinkage is learned from the data.
The framework of the graphical model from Figure 1 allows for flexibility in the selection of functional forms and distributions.We employ Leave One Out Cross Validation (LOO-CV) estimates of posterior predictive model utility to compare options.The full process is best described in the context of examples.We illustrate using simulated data and data from a trial of an AIDS treatment that has previously been analyzed by (Shen and He, to appear) and others studying heterogeneity.

A Simulated Example.
We simulated data from a model of the sort described in Figure 1 in which the continuous heterogeneity is linear and the latent class memberships are generated according to a logistic regression as specified below: X consisted of 5 predictor variables generated from a standard normal distribution.Shif ted_Gamma(µ, shape, rate) denotes a Gamma distribution shifted to have mean µ.We chose shape 0 = shape 1 = 1 and scale 0 = scale 1 = 10 so that the error distributions were highly skewed as in Figure 2 below.

Sample Models and
The Importance of Flexible Error Distributions.We fit two models to the data simulated from the above process.The two models only differed in their error distributions and were correctly specified in all other respects.For the first model, which we will refer to as M N orm , we assumed Gaussian error distributions: The second model, which we will call M F lex , is the same as the first except the error distributions are each a mixture of three normal components.In M F lex , where p z i = 1 and p z i d z i = 0. We put weakly informative normal priors on d 1 and d 2 and a Dirichlet(1) prior on the p i 's.d 3 is then determined by the constraint that the error distribution must have mean 0.
There are a couple of notes on identifiability.First, to avoid aliasing issues, in all models of this form we require that λ G i ∆ increases with the latent class label G i .Second, note that we have included the covariance parameter ρ between the potential outcomes in the graphical but not the algebraic representation of the model.This is because ρ is completely unidentified by the data as we never observe both potential outcomes for the same unit.We do not wish to assert that the potential outcomes are conditionally independent given X and G, so we include ρ in the graph.But only the marginal distributions of Y z serve to identify the parameters of interest, the posterior distributions of which do not depend on ρ.

[[[ADD FIG-URE 3]]]
We see that the M N orm estimates for the parameters of interest are off target, while the M F lex estimates of the same parameters are fairly accurate.This indicates sensitivity to misspecification of the error distribution and reassures us that the strategy of employing a flexible (mixture of normals) error distribution is sufficient to handle the problem.

Model Comparison/Evaluation and the Importance of the Utility Function.
To illustrate model comparison, we also consider a third model, which we will denote by M Cont , that only includes continuous effect modification (i.e. has no latent classes) but also has flexible error distributions.We will use LOO-CV to compare M Cont to M F lex , which we know to be the superior model.We follow the framework for model comparison laid out in (Vehtari and Lampinen, 2002).
A sensible measure of a model M 's value is the expected utility of using M to make predictions about future observations generated by the same process that generated the training data.A Bayesian model produces a posterior predictive distribution for future outcome y new given future covariates x new and the data D that the model M was fit to: p(y|x new , D, M ) = p(y|x new , θ, D, M )p(θ|D, M )dθ, where θ denotes the model parameters.The utility of M for predicting a new outcome is some function u[y new , x new , p(y|x new , D, M )] of the outcome and the posterior predictive distribution that measures how well the posterior predictive distribution predicted the outcome.We can estimate the expected utility of a model on populations similar to the training data as the average 1 where D −i denotes the data with the i th observation removed and N is the number of observations in D. This is the LOO-CV estimate of the expected utility of a model M. To estimate the expected utility on a population whose covariates differ from the training data in known ways, a weighted average can be used.Because it is computationally prohibitive to fit the model once for each data point, we approximate the LOO-CV estimate using an importance sampling scheme proposed by (Gelfand, 1996;Vehtari and Lampinen, 2002).
To compare two models M 1 and M 2 , interest lies in their expected difference in utility, which can be estimated as: Generally, the choice of utility function depends on the application.For many applications, the posterior predictive mean is taken as the forecast and an appropriate utility is a monotonic function of the distance of the posterior predictive mean from the actual outcome.For example, the squared error utility function would be: Such utilities are problematic for the purpose of distinguishing models that contain discrete latent class heterogeneity from those that contain only continuous heterogeneity because they ignore the shape of the posterior predictive distribution.If there really is heterogeneity, the posterior predictive distribution for a latent class model will be multimodal and its mean will lie somewhere between the modes.The posterior predictive mean for a continuous effect modification model will usually be similar, so utilities based on the accuracy of the posterior predictive mean will have low power to distinguish models with these types of heterogeneity.
A commonly used utility function that does not suffer from this problem is the posterior predictive density (ppd): u ppd [y new , x new , p(y|x new , D, M )] = p(y new |x new , D, M ).This utility rewards models that place lots of posterior predictive probability mass near future outcome values.A model with a multimodal posterior predictive distribution would be rewarded for outcomes that lie near any mode and penalized for outcomes that lie in the low density regions between modes.This utility has several nice theoretical properties as well.The model with the highest mean posterior predictive density minimizes Kullback Leibler distance to the true model.Posterior predictive density is also a proper scoring rule (Dawid,?).A drawback of this utility is that it is sensitive to our choice of error distribution, and we do not directly care about modeling the error distribution for our application.If we use very flexible error distributions for all candidate models, though, this should not be a serious problem.In our simulated example, the LOO-CV estimated expected ppd of M F lex was .04 compared to .03 for M Cont .A paired t-test comparing the samples of LOO-CV pdds from the two models rejected the null hypothesis that E[u M F lex − u M Cont ] = 0 with p-value numerically 0.
All code for simulations discussed in this section is available at [asdfdkl;fjaldkfj].

2.3.
Re-analysis of Data from the ACTG 320 Clinical Trial.The ACTG 320 trial compared two AIDS treatments-a combination of indinavir, zidovudine, and lamivudine versus just zidovudine and lamivudine.Following (Shen and He, to appear) who themselves follow (Hammer et al., 1997) and (Zhao et al., 2013), we take change in CD4 count at the 24th week of treatment as the response variable, exclude patients with missing outcome values or extreme CD4 counts, and ignore any bias that we may induce by these exclusions.We are left with a dataset of 800 patients.A summary of the data is included in the Appendix.

[[[ADD]]]
Before fitting any models, we test the null hypothesis of a constant treatment effect using Rosenabaum's covariance adjustment test (Rosenbaum, 2002).The test produces a p value that is approximately 0, so we are quite certain that there is heterogeneity.The question remains whether it is related to recorded covariates and whether we can effectively model it.
We fit multiple models and compare them using LOO-CV with posterior predictive density as the utility function.In some models, we use just the 3 covariates that Shen and He considered (baseline CD4, baseline RNA, and age), while in others we take advantage of regularization to include the 9 others that were available.M 1 is Shen and He's model with two latent classes, a common normal error distribution for all patients, no continuous effect modification, and just 3 covariates.M 2 is a constant effect model with separate flexible error distributions for each treatment group.Note that 'constant effect' is a misnomer, since the distinct error distributions for the two treatment arms allow for heterogeneity, just not associated with the covariates.M 3 is the same as M 1 but with separate flexible error distributions for each treatment group.M 4 is the same as M 3 but includes all 12 covariates.M 5 is a continuous effect modification model with separate flexible error distributions for each treatment group and only 3 covariates.M 6 is the same as M 5 but includes all 12 covariates.M 7 is a 2 latent class mixture model with continuous effect modification and flexible error distributions and only 3 covariates.M 8 is the same as M 7 but includes all 12 available covariates.M 9 is the same as M 8 except that it has 3 latent classes instead of 2. All continuous effect modification was specified as linear, and all latent class membership models were specified as logistic or, in the case of M 9 , multinomial logistic.A summary of relevant parameter estimates from these models is in the Appendix along with full specifications of each model.

[[[ADD]]]
Figure 4 depicts the LOO-CV estimated expected posterior predictive densities of each model.The first thing that jumps out in this plot is that M 1 , which is Shen and He's model with a normal error distribution, performs far worse than the other models which all use flexible error distributions.This is not necessarily meaningful, however.Our parameters of interest do not govern the error distribution, so the utility could be rewarding these models purely for better modeling an aspect of the data that is not important to us.But comparing the parameter estimates of M 3 (which is Shen and He's model with flexible error distributions) to M 1 , we see that there are substantive differences.The estimated difference in treatment effects between latent classes is significantly smaller in M 3 than in M 1 .Observing that the residuals are highly skewed (Figure 5) and recalling the lessons learned from the simulation in Section 2.2.1, we suspect that the misspecified error distribution of M 1 biased the estimates of parameters of interest.However, it is still true that much of the difference in expected utility could be due to error distribution alone.
Next we turn our attention to the models with flexible error distributions.The most complex model, M 9 , has the highest expected utility.When comparing models, we note that the estimated utilities are the sample means of the LOO-CV posterior predictive densities of the 800 observations.Since the samples of LOO-CV ppds produced by each model are based on the same observations, they are dependent and can be compared using classical methods for dependent samples such as paired t-tests or Wilcoxon signed rank tests.Paired t-tests indicate that the sample mean LOO-CV utility for M 9 is statistically significantly greater than the sample mean LOO-CV utilities of every other model.We can obtain a conservative p-value for the null hypothesis that M 9 has the highest true expected utility of all models considered by taking the p-value of the comparison between M 9 and the next best model (M 8 ) and adjusting for multiple comparisons using Holm's method.The paired t-test comparing M 9 to M 8 had p-value .006,and adjusting for the other 7 similar comparisons we might have made (i.e.testing whether models 2 through 8 were the best) yields p ≈ .04.That is, the probability of M 9 having such a superior estimated utility due to sampling variation alone if any of the other models had true expected utilities as good as M 9 's is less than .04('less than' because our p-value is conservative).This indicates strong but not necessarily overwhelming support for M 9 , so we would not completely dismiss other models with similar utilities such as M 8 , M 7 , and M 4 .We definitely prefer M 9 but would take its implications with a grain of salt if they contradicted one of the other models with fairly similar utilities.
We now take a closer look at what the models said about heterogeneity.Where comparisons between models could be made, the models were generally in agreement.First, every model found that heterogeneity was associated with the covariates (apart from the constant model M 2, obviously).Of the three covariates that were included in every model (except M 2 ), baseline CD4 count and RNA levels, which we will denote cd4 0 and rna 0 , were unanimously positively associated with treatment effect after adjusting for other covariates.The models that contained all 12 covariates also agreed that weight was positively associated and prior zidovudine exposure negatively associated with treatment effect adjusting for other covariates.(We will omit 'adjusting for other covariates' for the remainder of this discussion, but it should be understood that all associations might depend on which other covariates were included in the model.) Every model that contained both continuous and discrete heterogeneity components (M 9 , M 8 , and M 7 ) attributed most covariate associated heterogeneity to discrete differences between latent classes.Every model that included latent classes agreed that there was a substantial difference of about 60-80 CD4 count between the highest treatment effect class and the lowest.The three class model, M 9 , also included a middle class with estimated treatment effect approximately 10 higher than in the lowest class.The treatment affect in the low class for an average patient was about 45-55 in all models.All latent class models agreed that cd4 0 and rna 0 were positively associated with membership in the highest class.The models with 12 covariates also agreed that weight was positively associated and prior zidovudine negatively associated with membership in the highest class.In the two class models (M 3 ,M 4 ,M 7 , and M 8 ), strength of association with class membership is easily discerned from the posterior distributions of the relevant logistic regression coefficients from the β G parameter.In the preferred three class model, in which class is determined by a multinomial logistic regression, the relationship between beta G and the description of association is more subtle.Figure 6 compares the M 9 posterior distributions of probability of membership in the highest effect class for three hypothetical patients-one with the maximum observed value of cd4 0 , one with the mean observed value of cd4 0 , and one with the minimum observed value of cd4 0 .All three hypothetical patients were assigned median values for all other covariates.Comparing the high and medium patients, we see that the high cd4 0 value makes low probabilities of membership in the highest treatment effect class less likely but does not alter the mode.For very low values of cd4 0 , membership in high treatment effect class is virtually impossible.So M 9 appears to pick up on a nonlinear aspect to the association between cd4 0 and highest treatment effect class membership.Low values of cd4 0 are also strongly associated with membership in the middle class.
The models that contained both continuous and discrete heterogeneity components (M 9 , M 8 , and M 7 ) also all detected a possible moderate linear association with rna 0 and no other significant linear associations.So our preferred model M 9 and all the other credible models together imply mostly discrete heterogeneity associated with cd4 0 , rna 0 , weight, and zidovudine exposure with possible modest continuous effect modification by rna 0 .The strong performance of M 9 might be attributed to the more flexible relationships it allows between covariates and high effect class membership.A more thorough analysis would explore models with nonlinear logistic regressions, more than 3 latent classes, interactions among covariates, and distinct error distributions for each latent class instead of just for each treatment group.
As a last step, we performed some basic posterior predictive checks (Gelman, 1995) to affirm that M 9 not only outperforms the other candidates but also fits the data reasonably well.First, Figure 7 compares a histogram of the outcomes from the real trial to a histogram of fake outcomes that were simulated from the posterior predictive mean values of the parameters from M 9 .The distributions are remarkably similar.One small difference is that the real data seems more sharply peaked near zero.We simulate 1000 fake new data sets from the posterior predictive distribution of M 9 , measure summary statistics of each fake data set (including kurtosis, to investigate that sharp peak), and check whether the corresponding summary statistics of the true data fall within the range of the simulations.The exercise is summarized in figure 8 below, in which the red dots indicate the summary statistic values in the real data.The model appears to fit well by these criteria, though the observed kurtosis is a tad unusual according to the posterior predictive distribution.

Figure 3
Figure 3 compares the estimates from M N orm to those from M F lex .[[[ADD FIG-URE 3]]] We see that the M N orm estimates for the parameters of interest are off target, while the M F lex estimates of the same parameters are fairly accurate.This indicates sensitivity to misspecification of the error distribution and reassures us that the strategy of employing a flexible (mixture of normals) error distribution is sufficient to handle the problem.