Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model

Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation. Imputation of partially observed covariates is complicated if the substantive model is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of multiple imputation may impute covariates from models that are incompatible with such substantive models. We show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model. We investigate through simulation the performance of this proposal, and compare it with existing approaches. Simulation results suggest our proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible. Stata software implementing the approach is freely available.


Introduction
Missing data is a pervasive problem in both experimental and observational medical research, causing a loss of information and potentially biasing inferences.In this article we focus on settings in which interest lies in fitting a substantive model relating an outcome to a number of covariates, one or more of which contain missing values.The method of multiple imputation (MI) has become an extremely popular approach for accommodating missing data in statistical analyses, both generally [1], and in the specific context of partially observed covariates [2].MI involves 'filling in' each missing value with draws from an appropriate distribution, leading to a number M of completed datasets.The substantive model can then be fitted to each of the M completed datasets, and the results combined across the M datasets using Rubin's rules [3], which account for the uncertainty due to the fact data have been imputed.MI is most often applied under the missing at random (MAR) assumption, which stipulates that the probability that data are missing are independent of the missing values, conditional on the observed data, although MI can also be used when data are missing not at random [3].
Parametric MI as originally proposed is based on a joint imputation model for the partially observed variables (conditional on any fully observed variables), which we refer to as 'joint model MI'.A popular alternative to joint model MI is the fully conditional specification (FCS) approach ( [4], [5]).FCS MI involves specifying a series of univariate models for the conditional distribution of each partially observed variable given the other variables.This permits a great deal of flexibility, since an appropriate regression model can be selected for each variable (e.g.linear regression for continuous variables, logistic regression for binary variables).Consequently, FCS MI is particularly appealing in settings in which a number of variables have missing data, some of which are continuous and some of which are discrete.
One of the strengths of MI is that it divides the process of dealing with missingness (the imputation stage) from the analysis of the completed data (the analysis stage).As has been previously discussed in detail, this division presents both opportunities and threats ( [6,7,8,9]).For example, we may be able to include so called auxiliary variables in the imputation model which are not used in the analysis stage.This offers the potential for increased efficiency and may also improve the plausibility of the MAR assumption holding, by conditioning on auxiliary variables which are predictive of missingness.Sometimes the imputer and analyst will be different people.If the imputer has additional knowledge which enables them to impose (correct) additional assumptions in the imputation model, the analyst will gain efficiency.
The division may however sometimes lead to problems.In the context of imputing partially observed covariates, imputations might be generated from a model which is incompatible with the substantive model, which may lead to (asymptotically) biased estimates of parameters in the latter.Two conditional models are said to be incompatible if there exists no joint model for which the conditionals (for the relevant variables) equal these conditional models.This is a particular issue when the substantive model contains non-linear covariate effects or interactions, with which default imputation model choices may be incompatible.For example, suppose the substantive model is the linear regression of Y on a continuous covariate X and X 2 , and we wish to impute missing values in X.The default choice for the imputation model for X|Y in software for MI is a normal linear model, with conditional mean equal to a linear function of Y , which is incompatible with the quadratic substantive model.Following imputation of X, X 2 is then passively imputed by squaring the imputed X values.In this case, estimates of the parameters of the substantive model from multiply imputed datasets will be biased (unless the quadratic coefficient is in truth zero), because within the subset of data where X has been imputed the association between X and Y is linear as a consequence assuming linearity in the imputation model [10].
Incompatibility between the imputation and substantive models can be avoided by specifying a joint model for outcome and covariates for which the conditional distribution of outcome given covariates matches the substantive model and then using the imputation model implied by this joint model.However, specification of a joint model is challenging when there are a number of partially observed covariates, particularly when some are continuous and some are discrete.In this setting the FCS method is an attractive option.However, the default univariate imputation models used in FCS may be incompatible with the substantive model.In this paper we therefore propose a modification of the popular FCS approach to MI which ensures that each of the univariate imputation models is compatible with the assumed substantive model.
We begin in Section 2 by introducing a motivating example from the Alzheimer's Disease Neuroimaging Initiative (ADNI).In Section 3 we formally define compatibility between an imputation model for partially observed covariates and a substantive model, explain when incompatibility implies imputation model mis-specification, and give examples of when this occurs.We then outline how imputation models can be specified which are compatible with a given substantive model within the joint modelling approach to MI, which motivates our modification of the FCS MI approach.In Section 4 we briefly review the standard FCS framework for MI in the setting of partially observed covariates.In Section 5 we describe our modification of the FCS MI approach which ensures that each univariate imputation model is compatible with the substantive model.We give details for how this can be done when the model of interest is i) normal linear regression, ii) a model for a discrete outcome (e.g.logistic and Poisson regression), or iii) a proportional hazards model.We report the results of a simulation study to investigate the performance of our proposed approach in Section 7. In Section 8 we apply our proposed approach to the motivating example.In Section 9 we discuss how our proposed approach can be used when, as is often the case, interest lies in fitting a number of different substantive models to the data.We conclude in Section 10 with a discussion.

Motivating example
The Alzheimer's Disease Neuroimaging Initiative (ADNI) was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies, and nonprofit organizations, as a 5-year public-private partnership.
The aims of ADNI included assessing the ability of imaging and other biomarkers to measure the progression of mild cognitive impairment and early Alzheimer's Disease (AD).The study aimed to recruit approximately 200 cognitively normal older individuals (controls), 400 with mild cognitive impairment (MCI), and 200 with early AD.
Participants underwent clinical and cognitive assessment and MRI brain scans at baseline and at specified intervals (every 6 or 12 months, depending on subject group) up to 3 years.Further details regarding ADNI are given in the Acknowledgements.
Recently Jack Jr et al used data from ADNI to investigate baseline predictors of time to conversion to AD in those subjects with MCI at baseline ( [11]).In particular, using Cox proportional hazards models they found evidence of a non-linear association between amyloid β 1-42 peptides (Aβ 1−42 ) measured from cerebrospinal fluid (CSF) at baseline and hazard of conversion.They also found evidence that lower baseline hippocampal volume was predictive of increased hazard, after adjusting for total intracranial volume (a measure of head size).Participants were invited to have CSF measured at baseline by lumbar puncture, but were not required to do so to participate in the study.Consequently, only around 50% of subjects had CSF Aβ 1−42 measured.The analysis of Jack Jr was restricted to the subset of n = 218 MCI subjects for whom CSF Aβ 1−42 was measured.It may be reasonable to assume that the propensity to agree to lumbar puncture (and thus have CSF Aβ 1−42 measured) is independent of time to conversion to AD, conditional on CSF Aβ 1−42 , and so such a complete case analysis might reasonably be assumed to be unbiased.However, it is inefficient, since it only uses data on 50% of MCI subjects.Jack Jr et al also found evidence that presence of the APOE4 gene, previously shown to be associated with development of AD in a number of studies, was associated with increased hazard for conversion to AD.MCI is a heterogeneous classification, with only a certain proportion of subjects eventually going on to develop AD.For each subject their family history was collected at baseline, in particular in relation to whether their mother or father suffered from dementia or AD.Given that there is a genetic component to the disease, we were interested to investigate whether the presence of family history of AD was associated with increased hazard of conversion to AD, by including covariates indicating whether the subject's mother and father had had AD (Table 1).Unfortunately, although family history of dementia was well recorded, family history of AD specifically suffered from missingness (see Table 1).
We aimed to estimate the parameters of a Cox proportional hazards model for hazard of conversion to AD using the available data from the n = 382 ADNI subjects who had MCI at baseline and who had at least one follow-up visit.Of these subjects, 167 were observed to convert to AD during follow-up.Our substantive model contained as covariates the variables listed in Table 1, plus the square of Aβ 1−42 to allow for the non-linear association previously identified by [11].In addition to CSF Aβ 1−42 , we include CSF Tau and P-tau as covariates, which are also thought to reflect pathology, and thus might be associated with hazard of conversion to AD. Tau and p-tau were log transformed to reduce skewness in their distribution.The FCS approach to MI is attractive here, since seven covariates are partially observed, with some continuous and some binary.However, we should be careful to 3 Multiple imputation of partially observed covariates

Setup
We consider the setting in which interest lies in fitting a model to a fully observed outcome Y with p partially observed covariates X = (X 1 , .., X p ) and q fully observed covariates Z = (Z 1 , .., Z q ).Let X obs and X mis denote the observed and missing components of X for a given subject, and let R be the vector of observation indicators whose elements are zero or one depending on whether the corresponding element on X is missing or observed.
We assume throughout that the data are missing at random (MAR) [12].Here MAR means that P (R|Y, X, Z) = P (R|Y, X obs , Z).We assume that (Y i , X i , Z i , R i ), i = 1, .., n are independent and identically distributed.Lastly, we let f (Y |X, Z, ψ) denote the 'substantive model', which is indexed by parameter ψ (ψ ∈ Ψ).We assume throughout that this substantive model is correctly specified.That is, there exists where f 0 (Y |X, Z) denotes the true conditional distribution of Y given X and Z.

Multiple imputation of partially observed covariates
In Bayesian parametric MI, to multiply impute missing values in X we specify a parametric model f (X|Z, Y, ω), ω ∈ Ω for the conditional distribution f (X|Y, Z).To create the mth imputed dataset we first draw ω (m) from its posterior distribution given the observed data {(Y, X obs , Z); i = 1, .., n} and a (usually noninformative) prior f (ω).For each subject the missing values (if any) X mis are imputed by taking a draw from the density f (X mis |X obs , Y, Z, ω (m) ) implied by f (X|Y, Z, ω (m) ).
Having created M imputed datasets, the substantive model parameter ψ is then estimated separately using each imputed dataset, resulting in estimates ψ1 , .., ψM and corresponding variances Var( ψ1 ), .., Var( ψM ).In this article we assume that the substantive model is fitted using maximum likelihood.Rubin's rules are then invoked to provide a final inference for ψ, with the estimator of ψ given by ψMI = and an estimate of the variance of ψMI is given by Suppose that the posited imputation model is correctly specified, so that there exists a value of ω ∈ Ω such that Then ψMI is a consistent estimator of ψ, and as the number of imputations M → ∞, confidence intervals based on Var( ψ) achieve coverage at or above their nominal level [6].

Compatibility and imputation model mis-specification
When an imputation model f (X|Z, Y, ω) is directly specified, it may be mis-specified if it is not compatible with the substantive model (assuming this is correctly specified).For example, if the correctly specified substantive model includes an interaction between a partially observed covariates and a fully observed covariate in their effect on the outcome Y , imputation models which do not allow for this interaction will generally (unless the interaction term is in truth zero) be mis-specified.Such considerations led to recommendations that imputation models be used which do not impose restrictions which will conflict with subsequent analyses of the imputed datasets [6,7].
Following Liu et al [13], we now define the notion of compatibility between a set of conditional models.Let A = (A 1 , .., A p ) be a vector of random variables, and let A −j = (A 1 , .., A j−1 , A j+1 , .., A p ). Then a set of conditional models {f j (A j |A −j , θ j ); θ j ∈ Θ j , j = 1, .., p} is said to be compatible if there exists a joint model g(A|θ), θ ∈ Θ and a collection of surjective maps {t j : Θ → Θ j ; j = 1, .., p} such that for each j, θ j ∈ Θ j , and θ ∈ t −1 j (θ j ) = {θ : Otherwise the set of models {f j ; j = 1, .., p} is said to be incompatible.
A weaker property which we shall also use is that of semi-compatibility for a set of models.A set of models is semi-compatible if they can be made compatible by setting one or more parameters to zero.More formally (again following Liu et al [13]), a set of conditional models {h j (A j |A −j , θ j , κ j ); θ j ∈ Θ j , κ j ∈ K j , j = 1, .., p} is said to be semi-compatible if there exists a set of compatible conditional models {f j (A j |A −j , θ j ); θ j ∈ Θ j , j = 1, .., p} such that f j (A j |A −j , θ j ) = h j (A j |A −j , θ j , κ j = 0), for j = 1, .., p.A set of compatible conditional models is always semi-compatible.
Incompatibility between the imputation and substantive models does not necessarily imply mis-specification of the former.For example, suppose the substantive model is Y |X ∼ (ψ 0 + ψ 1 X, σ 2 ψ ) and the imputation model is Then again the models are semi-compatible but not compatible, and unless ψ 2 = 0 in truth, the imputation model will be mis-specified, since there exist no joint model with conditionals corresponding to the substantive and imputation models.Consequently the MI estimator ψMI will be inconsistent, as demonstrated through simulation by von Hippel [15] and Seaman et al [10].
Incompatibility may also arise when default imputation models are used for covariates in non-linear substantive models.For example, suppose T (rather than Y ) is a time to event outcome which is not subject to censoring, and that the substantive model is the parametric exponential model, with hazard function h(t) = h 0 exp(ψX), with X a continuous partially observed covariate.In this case H 0 (T ) = T 0 h 0 (t)dt = h 0 T .Then suppose, following the recent recommendations of White and Royston [16], we adopt a normal linear imputation model for X|T ∼ N (ω 0 +ω 1 T, σ 2 ω ) with T ∝ H 0 (T ) as covariate.Then the two models are incompatible, and consequently the MI estimator ψMI is inconsistent (although simulations by White and Royston [16] show the bias is often small).
In conclusion, except in cases where the imputation and substantive models can be made compatible by restricting the parameter space Ω of the imputation model, incompatibility between the two implies the imputation model is mis-specified (assuming correct specification of the substantive model).Consequently, when choosing the covariate imputation model f (X|Z, Y, ω) we should (at least) ensure that it is either compatible with the substantive model, or a restriction of it is compatible with the substantive model.

Joint model imputation
The natural route to ensuring compatibility between the imputation and substantive models is to explicitly specify a joint model g(Y, X|Z, θ) which has the substantive model f (Y |X, Z, ψ) as its corresponding conditional, and to derive the implied imputation model.Given the (correctly) specified substantive model f (Y |X, Z, ψ), such joint models are specified by defining a model f (X|Z, δ).The imputation model is then given by We emphasize that using a compatible imputation model does not guarantee it is correctly specified -this is only true if, in addition to the substantive model being correctly specified, f (X|Z, δ) is correctly specified.
In cases where p = 1 and X is univariate, specification of a model f (X|Z, δ) is relatively straightforward.When X is multivariate, and particularly when it contains a mixture of continuous and discrete variables, specification of a joint model f (X|Z, δ) becomes more challenging.In this setting Ibrahim et al [17] proposed specification by factorising the joint distribution of X|Z as a product of univariate densities of the form This breaks the problem of joint specification into the easier task of specification of a series of univariate models.
This means that appropriate univariate regression models can be specified depending on the type (i.e.continuous, discrete, ordered discrete) of each variable.However, as the dimension of X increases the number of possible orderings of its components increases rapidly, and it is not obvious which ordering should be chosen.As far as we aware this approach to MI has not been adopted by applied researchers.

Fully conditional specification imputation
In the more general setting of MI in multivariate data, the fully conditional specification (FCS) approach to MI, which we describe in detail in the following section, similarly splits the task of specification of a joint model into a series of univariate model specifications.In FCS MI models are specified for each partially observed variable conditional on all other variables.In contrast to the approach proposed by Ibrahim et al [17], no choice of ordering for model specification is required.FCS MI has now become an extremely popular approach to MI generally [5].
Application of FCS for imputation of partially observed covariates involves specification of models of the form f (X j |X −j , Z, Y ), where X −j denotes the components of X with X j removed.As we have described, for certain substantive models, such as those involving non-linear covariate effects or interactions, default choices of these imputation models within FCS will be incompatible (and further, not semi-compatible) with the substantive model, and will therefore be mis-specified.Motivated by this, in Section 5 we propose a modification of FCS MI which ensures that each of the covariate models f (X j |X −j , Z, Y ) is compatible with the substantive model.

Review of fully conditional specification multiple imputation
In this section we review the fully conditional specification (FCS) approach to MI [5], within the setup of partially observed covariates previously defined in Section 3.

The fully conditional specification algorithm
For each partially observed covariate X j , we posit an imputation model f (X j |X −j , Z, Y, θ j ), with parameter θ j , where X −j = (X 1 , .., X j−1 , X j+1 , .., X p ).This is typically a generalised linear model chosen according to the type of X j (e.g.continuous, binary, count).Furthermore, a non-informative prior distribution p(θ j ) for θ j is specified.) denote the vector of observed and imputed values at iteration t.The tth iteration of the algorithm consists of drawing from the following distributions (up to constants of proportionality): x ) 2 ) . . .
p−1 , z, y, θ p ) Thus, for the partially observed covariate X j the algorithm first draws from the posterior distribution of the corresponding imputation model parameters determined by the prior and likelihood corresponding to the imputation model fitted to data from subjects for whom X j was observed, conditional on all the other variables (observed plus most recently imputed values).Note that in this respect the FCS algorithm differs from a standard Gibbs sampler, which at the same step would additionally condition on x mis j from the previous iteration.Missing values in X j are then imputed from the imputation model using the parameter value drawn in the preceding step.After a sufficient number of iterations it is assumed that the algorithm has converged to a stationary distribution, and the final draws of the missing data form a single imputed dataset.The process is then repeated to create as many imputed datasets as desired.In software implementations of FCS MI the variables are typically updated in the ordering for which the missingness pattern is closest to monotone.Finally, the substantive model is fitted to each imputed dataset, and the results combined using Rubin's Rules as described previously.

Statistical properties
Despite the fact FCS MI has been applied widely in a number of fields [5], until recently few results were available regarding its validity.This is due to the fact that one can specify imputation models (which in our setting are f (X j |X −j , Z, Y, θ j )) that are not mutually compatible ( [18]).In this case it is not clear to what distribution, if any, the algorithm will converge.
Liu et al have recently given sufficient conditions under which FCS MI is asymptotically equivalent to MI from a Bayesian joint model [13].Principal among these is that the set of conditional imputation models are mutually compatible.However, compatibility is not sufficient for equivalence between FCS and MI from a Bayesian joint model.One situation in which equivalence does not hold despite the imputation models being compatible is when information regarding parameters of a conditional model is contained, but not utilised, in the marginal distribution of the variables being conditioned on.An example of this is when a binary variable is imputed using logistic regression conditional on a continuous variable, with the latter imputed using a normal linear regression model.Although these models are compatible with each other, FCS imputation fails to utilise the information in the conditional distribution of the continuous variable given the binary variable regarding the logistic regression parameters, and consequently FCS MI is not equivalent to MI from a Bayesian model.
Liu et al further show that provided each of the conditional models is correctly specified, the estimator ψMI is consistent [13].Thus if a set of conditional models is used which are semi-compatible, and each is correctly specified, consistent estimates are obtained.Note that in this case since FCS is not necessarily equivalent to imputation from a Bayesian joint model, there is no guarantee that Rubin's rule for the variance will provide valid inferences.This result can also be used to conclude that in the linear-logistic example described in the previous paragraph, in which the models are compatible (and therefore also semi-compatible) ψMI is consistent provided both models are correctly specified.If the conditional models used are not even semi-compatible, in general we expect inconsistent estimates.
5 Fully conditional specification imputation accommodating the substantive model In this section we describe how the standard FCS algorithm, described in Section 4, can be modified to ensure that each of the univariate imputation models used is compatible with the substantive model.We term the algorithm substantive model compatible FCS (SMC-FCS).

The algorithm
First we specify a non-informative prior for the parameters of the substantive model, denoted f (ψ).Then for each j = 1, .., p we specify a model f (X j |X −j , Z, φ j ) (j = 1, .., p) and non-informative prior f (φ j ).At iteration t, for j = 1, .., p we first draw ψ (t,j) from Then a draw φ (t) Note that here, as in a standard Gibbs sampler, the draw is made from the posterior corresponding to the fit of the model f (X j |X −j , Z, φ j ) using data (imputed and observed) from all subjects, rather than to only those subjects for whom X j is observed.This is necessary since, if missingness in X j is dependent on Y , drawing from the posterior corresponding to the model fitted to only those with X j observed would introduce bias.
For each subject with X j missing we then impute their missing value from the density proportional to By construction, following equation (1), this density will be compatible with the substantive model f (Y |X, Z, ψ).
However, in general the density will not belong to a standard parametric family, complicating direct simulation of values.In Section 6 we therefore show how rejection sampling can be used to draw from the density, giving implementations for a number of important types of substantive model.

Statistical properties
SMC-FCS is an example of a (possibly incompatible) Gibbs sampler.However, as with standard FCS, determining the statistical properties of the algorithm in generality is challenging.In some special cases the algorithm corresponds to a Gibbs sampler for a well defined Bayesian model.This will be true when there exists a joint model g(X|Z, γ) and prior f (γ) for which the conditionals required for a Gibbs sampler correspond to those in equations ( 4), (5) and (6).Since in this case SMC-FCS is equivalent to imputation from a Bayesian joint model, Rubin's rules can then be applied for inference.
As for standard FCS MI, compatibility between the models f (X j |X −j , Z, φ j ) is not sufficient for equivalence with Bayesian MI.If one partially observed covariate is modelled conditional on a second using logistic regression, with the second modelled with normal linear regression given the first, SMC-FCS is not equivalent to Bayesian joint model MI, despite these models being compatible, for the reason given in Section 4.2.
If the covariate models f (X j |X −j , Z, φ j ) are semi-compatible and correctly specified, we conjecture that, the MI estimator of the substantive model parameters ψMI will be consistent.We investigate this in Section 7 using simulations.In cases where SMC-FCS is consistent but not equivalent to imputation from a Bayesian joint model there is no guarantee that confidence intervals based on Rubin's variance estimator will give at least nominal coverage.Lastly, if the covariate models are not semi-compatible, in general we do not expect the estimator ψMI to be consistent.
In software implementations of the standard FCS algorithm 10 iterations are typically used to 'burn-in', based on empirical experience suggesting this is often sufficient for convergence of the sampler.Since when imputing missing values in X j our proposed modification of FCS involves fitting models using the most recently imputed values of X j , we expect SMC-FCS to require a larger number of iterations in order to converge to the required stationary distribution, assuming it exists.

Sampling from the imputation model
In this section we give details of how the method of rejection sampling can be used to sample from the density given in equation ( 6) for some of the most common types of substantive model.Rejection sampling involves creating draws from a proposal density (from which it is easy to draw), until a draw is made satisfying a particular condition.
Suppressing the iteration index t, we choose f (X j |X −j , Z, φ j ) as our proposal density, on the assumption that it is easy to sample from this density.To use rejection sampling, the ratio of the target density to the proposal density (up to a constant of proportionality) must be bounded above in X j [19].Here this ratio is proportional to: Let c(Y, X −j , Z, ψ) denote an upper bound (in X j ) for f (Y |X j , X −j , Z, ψ).To generate a draw from the density proportional to equation ( 6), we sample pairs of values X * j from the density given by f (X j |X −j , Z, φ j ) and U from the uniform distribution on (0,1) until: When this inequality is satisfied, the value X * j is a draw from the density proportional to equation ( 6).We must therefore bound f (Y |X j , X −j , Z, ψ) in X j .The bound will depend on the specification of the substantive model.In the following we derive bounds for the cases of i) a normal regression model, ii) a model for a discrete outcome Y , and iii) a proportional hazards survival model.

Normal regression
Suppose that the substantive model specifies that Y is normal, with conditional mean E(Y |X) = g(X j , X −j , Z, β) for some function g(), and residual variance σ 2 ǫ , so that ψ = (β, σ 2 ǫ ).Then: To generate a draw for the missing value X j , we draw a value X * j from f (X j |X −j , Z, φ j ), and U from the uniform distribution on (0, 1).The draw X * j is accepted if: If the draw is not accepted, new draws of X * j and U are made until they satisfy the condition in equation (9).

Discrete outcomes
Now consider a discrete outcome Y .This includes the case of a binary outcome Y , which is commonly modelled using logistic regression.When Y is discrete, f (Y |X j , X −j , Z, ψ) is a probability, and hence is less than or equal to one.The rejection sampling algorithm then consists of drawing X * j from f (X j |X −j , Z, φ j ) and U ∼ U (0, 1), and accepting X * j when:

Proportional hazards models
Lastly, suppose that interest lies in the time T to an event of interest, but this time may be censored.Let C denote the censoring time.We observe W = min(T, C), and D = 1(T < C), which denotes whether the subject's event time has been observed or was censored.We assume that censoring is noninformative, in the sense that T ⊥ ⊥C|X, Z.
Furthermore, we assume C⊥ ⊥X|Z.Together these assumptions allow us to avoid modelling the censoring process ( [20]).
We assume that the substantive model is the proportional hazards model: where h(t|X) denotes the hazard at time t, h 0 (t) denotes the baseline hazard at time t and g(X j , X −j , Z, β) denotes a function of X and Z indexed by parameter β.In parametric proportional hazards models the baseline hazard function is parametrized by a finite set of parameters λ, so that ψ = (β, λ).In Cox's proportional hazards model the baseline hazard is allowed to be arbitrary, so that ψ = (β, h 0 (.)) with h 0 (.) an infinite dimensional parameter.
We first consider how to sample X j for a subject for whom D = 0, i.e. their time to event is censored.Then since by assumption T ⊥ ⊥C|X, Z and C⊥ ⊥X|Z, We draw X * j from f (X j |X −j , Z, φ j ) and U ∼ U (0, 1), and accept X * j when: where H 0 (t) = t 0 h 0 (s)ds denotes the cumulative baseline hazard function.For a subject who is not censored (D = 1), we have: Since exp() is monotonically increasing, this expression takes its maximum when g(X j , X −j , β) − H 0 (t)e g(Xj ,X−j ,β) takes its maximum.Differentiating this with respect to g() and setting the resulting expression to zero shows that this occurs when H 0 (t)e g(Xj ,X−j ,β) = 1.Therefore: We can thus draw X * j from f (X j |X −j , Z, φ j ) and U ∼ U (0, 1), and accept X * j when:

Simulation study
In this section we describe the results of simulation studies to investigate the performance SMC-FCS in situations in which the substantive model is incompatible with standard choices for covariate imputation models.

Linear regression with quadratic covariate effects
We first simulated from a linear regression substantive model with a single covariate X with linear and quadratic effects, for which standard imputation model choices for the covariate X are incompatible.

Simulation setup
The outcome Y was simulated according to: ).These coefficients were chosen to give a moderately strong U-shaped association between Y and X.The variance σ 2 ǫ was chosen such that the coefficient of determination R 2 was equal to 0.5.
The covariate X was simulated from a normal, a log-normal, or a normal mixture distribution.For all three distributions X had mean 2 and variance 1.For the log-normal distribution, X was generated by exponentiating a draw from N (log( √ 3.2), log (5/4)).For the normal mixture distribution, X was drawn from N (1.125, 0.234) with probability 0.5 and from N (2.875, 0.234) with probability 0.5.
For each distribution of X, values were made missing either according to the MCAR mechanism P (R = 1|X, Y ) = 0.7 or the MAR mechanism and α 0 was chosen to make the marginal probability of observing X equal to 0.7.In all simulations datasets for n = 1, 000 subjects were generated, and 1,000 simulations were performed for each scenario.

Estimation methods
For each simulated dataset we first imputed the missing values of X using a linear regression model with the ice command in Stata.We used the default imputation model, with the expectation of X modelled as a linear function of Y .Note that since here there is only one partially observed variable, no iteration is required within FCS.Missing values of X 2 were then passively imputed as the square of these imputed values of X ('linear imputation').Second, we imputed the missing X values using the 'transform then impute' or 'just another variable' (JAV) approach proposed by [15], that is, by treating X 2 as another variable to be imputed in the ice command in Stata.Third, we imputed X using the mice.impute.quadraticfunction in the R MICE package ('polynomial combination').This implements a method recently proposed by Van Buuren (p140 [21]), which imputes the linear combination of X and X 2 which enters in the linear predictor of the substantive model, followed by solving a quadratic equation for X.
Lastly, we used SMC-FCS, assuming X is marginally normally distributed for all scenarios.We chose to implement SMC-FCS using the same marginal model for X to explore the performance of (substantive model) compatible but mis-specified imputation models.For all imputation approaches, 10 imputed datasets were generated, and estimates and confidence intervals (CI) found using Rubin's rules.We used 10 iterations per imputation in SMC-FCS, and the default 10 iterations in the ice command.With X univariate, SMC-FCS is equivalent to imputation from the corresponding Bayesian model.We used standard non-informative priors for normal linear regression parameters in SMC-FCS, i.e. f (β, σ 2 ) ∝ 1/σ 2 , where β and σ 2 denote the vector of regression coefficients and residual variance respectively.

Results
Table 2 shows the results of the simulations, showing the empirical mean and standard deviation of estimates of β 2 and the coverage of nominal 95% confidence intervals.With normally distributed X and MCAR, linear imputation resulted in biased estimates, with considerable attenuation in β2 towards zero as expected.Here the imputation model being used is incompatible (with the substantive model) and mis-specified.Confidence interval coverage for linear imputation was also extremely poor, with zero coverage for β 2 .JAV, polynomial combination and SMC-FCS gave unbiased results, with similar efficiency to each other.JAV and SMC-FCS had CI coverage close to 95%, but polynomial combination had slightly low coverage.With X log-normally distributed and MCAR linear imputation was again biased with poor CI coverage.JAV was unbiased, although estimates were considerably more variable SMC-FCS.Furthermore, the coverage of the CI for β 2 from JAV was only 84%.The polynomial combination method performed similarly to JAV here.Despite the assumed model for X being mis-specified, SMC-FCS was unbiased and the 95% CI for β 2 had the correct coverage.With X distributed according to a normal mixture model and MCAR, JAV and polynomial combination were again unbiased.The CI coverage of JAV and polynomial combination for β 2 was close to 95%.Linear imputation continued to be severly biased.SMC-FCS was somewhat biased towards the null for β 2 , and consequently CI coverage for β 2 was only 74%. for β 2 , as did linear imputation.Despite using a mis-specified model for the distribution of X, SMC-FCS was again unbiased, although the CI for β 2 had somewhat lower than nominal coverage.Lastly with X distributed as a mixture of two normals and MAR, linear imputation continued to be substantially biased.JAV was biased to a lesser extent, although its CI for β had poor coverage.SMC-FCS was biased (since the assumed distribution for X was incorrect) towards zero, and its CI for β 2 had extremely poor coverage.The polynomial combination method performed best here, with unbiased estimates of β 2 and only somewhat reduced CI coverage.

Linear regression with interaction
Next we considered a linear regression substantive model in which two covariates interact with each other in their effect on outcome.

Simulation setup
The outcome Y was generated according to: , where as before, σ 2 ǫ was chosen to give R 2 = 0.5.In the first scenario X 1 and X 2 were generated from a bivariate normal distribution, with each covariate having mean 2 and variance 1, and the correlation between the two equal to 0.5.To explore robustness of the imputation methods to violations of normality assumptions, in a second scenario log(X 1 ) and log(X 2 ) were generated from a bivariate normal distribution so that they both had marginal distribution N (log( √ 3.2), log 5/4) and the correlation between the two was equal to 0.5.To investigate robustness to linearity assumptions between covariates, in a third scenario we generated X 1 ∼ N (2, 1) and . Fourth, we generated X 1 from a Bernoulli distribution with probability 0.5, and X 2 |X 1 ∼ N (X 1 , 1).To explore robustness to violations of normality assumptions, in the final scenario we generated X 1 from a Bernoulli distribution with probability 0.5 and where v ∼ N (log( √ 3.2), log (5/4)).
Values in both X 1 and X 2 were first made (independently) MCAR, each with probability 0.7 of being observed.
We then repeated the simulations with X 1 observed with probability expit(α 0 + α 1 Y ) where α 1 = −1/SD(Y ) and α 0 was chosen to make the marginal probability of observing X equal to 0.7, and with X 2 also observed with probability expit(α 0 + α 1 Y ).

Estimation methods
For each simulated dataset, as before, estimates were obtained first using CC.In 'FCS' missing values in X 1 and X 2 were imputed using the ice command in Stata.A linear regression imputation model was used when the covariate was continuous and a logistic regression imputation model when the covariate was binary.In the imputation model for X 1 (X 2 ) the outcome Y , X 2 (X 1 ) and their interaction Y X 2 (Y X 1 ) were included as explanatory variables.
We obtained estimates using JAV by including the interaction variable X 1 X 2 as an additional variable in the ice command.Covariate X 1 (X 2 ) was imputed using a linear regression model (even when X 1 was binary) with Y , X 2 (X 1 ) and X 1 X 2 as explanatory variables.The interaction term X 1 X 2 was imputed using a linear regression model with Y , X 1 and X 2 as explanatory variables.Since all conditional imputation models are linear regressions with other variables included linearly, FCS is here equivalent to imputation from a trivariate normal imputation Lastly, we obtained estimates using SMC-FCS, assuming a normal regression model for X 1 |X 2 or a logistic regression model when X 1 was binary.A linear regression model was assumed for X 2 |X 1 .When assuming X 1 |X 2 and X 2 |X 1 are linear regressions, SMC-FCS is equivalent to imputation from the Bayesian model defined by the substantive model and a bivariate normal model for (X 1 , X 2 ).In contrast, when assuming a logistic regression model for X 1 |X 2 and a linear regression for X 2 |X 1 , although these models are compatible, SMC-FCS is not equivalent to imputation from a Bayesian model.When drawing from the posterior of the logistic regression parameters (equation ( 5)) we used a multivariate normal, with mean equal to the MLE and variance covariance corresponding to the inverse of the 'observed' data information matrix.
Table 3: Simulation results -linear regression with interaction.Empirical mean (SD) of estimates of β 1 = 1 and β 3 = 1 from 1,000 simulations, using complete case analysis, standard FCS imputation (FCS), just another variable imputation (JAV), and SMC-FCS.Empirical coverage of nominal 95% confidence intervals is also shown (Cov).Monte-Carlo errors for means and SDs are all less than 0.04 for β 1 and less than 0.02 for β 3 .Monte-Carlo errors for confidence interval coverage are less than 1.6%.

Cox proportional hazards models
Lastly, we performed simulations for imputing missing covariates with a Cox proportional hazards model.Recently [16] derived approximate results to inform the choice of imputation model in this context.They recommended including the event indicator and the Nelson-Aalen estimate of the marginal cumulative hazard function as covariates in imputation models.Their simulation results showed that this approach generally worked well for imputing normally distributed covariates, except when the covariate effects were large, when some attenuation towards the null occurred.

Simulation setup
Survival times were simulated with hazard function h(t|X) = 0.002 exp(β 1 X 1 + β 2 X 2 ) with β 1 = β 2 = 1.Censoring times were generated from an exponential distribution with hazard 0.002.We simulated X 1 from a Bernoulli distribution with probability 0.5, and X 2 |X 1 ∼ N (X 1 , 1).Values in X 1 and X 2 were made (independently) missing completely at random, with probability of observation 0.7.We performed simulations with n = 1, 000 subjects and also with n = 100 subjects.

Estimation methods
For each simulated dataset we first estimated β by fitting the Cox proportional hazards model to the complete cases.Next we multiply imputed the missing values in X 1 and X 2 using FCS (10 imputations).A linear regression imputation model was used for X 2 and a logistic regression model for X 1 .Following the recommendations of [16], we included the event indicator D and the Nelson-Aalen estimate of the marginal cumulative hazard as covariates in both imputation models (FCS).Lastly we estimated β using SMC-FCS as described in Section 6.3, assuming a logistic regression model for X 1 |X 2 and a linear regression model for X 2 |X 1 .As described previously, the SMC-FCS algorithm involves taking draws from the posterior distribution of the parameter ψ in the substantive model.For Cox's proportional hazards model ψ = (β, H 0 (.)), where H 0 (.) is an infinite dimensional parameter representing the arbitrary baseline hazard function.It is unclear how a draw can be made from the posterior distribution of H 0 (.), and indeed whether Rubin's rules can be expected to give asymptotically unbiased variance estimates in a semiparametric model.In our simulation study we allowed for uncertainty in β by drawing a new value from a (bivariate) normal distribution with mean equal to the current estimate of β and with covariance matrix based on the usual 'observed' data information matrix.We then updated H 0 (.) using the usual Breslow estimator, conditioning on the newly drawn value of β.

Results
Table 4 shows the results from the 1,000 simulations.CC is consistent here, since missingness is completely at random.However, with n = 100 CC showed some upward finite sample bias for both β 1 and β 2 .In accordance Table 4: Cox proportional hazards outcome model simulation results.Empirical mean (SD) of estimates of β 1 = 1 and β 2 = 1 from 1,000 simulations, using complete case analysis, multiple imputation of X 1 and X 2 using FCS with the event indicator and Nelson-Aalen marginal baseline cumulative hazard function as covariates (FCS), and SMC-FCS.Empirical coverage of nominal 95% confidence intervals is also shown (Cov with the results of White and Royston, FCS resulted in somewhat biased estimates, with the bias larger for the coefficient corresponding to the continuous covariate, although confidence interval coverage for both β 1 and β 2 was approximately 95%.SMC-FCS, like CC, showed some slight upward bias, but was somewhat more efficient.Of interest was that the confidence intervals had correct coverage, despite the fact that our implementation ignores uncertainty in the baseline hazard function.
For n = 1, 000, CC was essentially unbiased.The biases of FCS were larger than for n = 100, which is due to the fact that the finite sample bias, which acted in the opposite direction to the bias caused by the approximation used in the FCS approach, had largely disappeared.Consequently, confidence interval coverage was below the nominal 95% level, with coverage for β 2 particularly poor at 47%.In contrast, SMC-FCS was unbiased and had correct confidence interval coverage.

Analysis of data from ADNI
Table 5 shows the estimated log hazard ratios from the substantive model fitted to the n = 127 complete cases (of whom 61 converted to AD).This showed borderline evidence of an association between CSF Aβ 1−42 and hazard of conversion, and borderline significant evidence of curvature in the association, in agreement with the findings of [11].The estimated association suggests that increasing Aβ 1−42 is associated with increased hazard of conversion up until a value of ≈ 14 ng/mL, after which hazard decreases.There was evidence that increased levels of Ptau were associated with increased hazard of conversion.Contrary to what we expected, having a mother or father with AD was suggestive of lower hazard of conversion to AD, although neither coefficient was statistically significant.Hippocampal volume was the strongest predictor of hazard (measured by statistical significance), with larger volumes associated with lower hazard of conversion.This is consistent with the findings of previous studies which have found that the hippocampus is one of the earliest structures of the brain to undergo atrophy during AD.
Next we used FCS MI to impute the partially observed baseline variables.50 imputations were used.Continuous course the validity of estimates from different models fitted to a set of multiple imputations depends on whether the imputation model is correctly specified.In practice all imputation models are likely to be mis-specified to some extent, but biases may be small provided imputation models preserve those features of the data which are subsequently investigated.It is therefore unrealistic in practice to expect a single set of multiple imputations to be suitable for all possible subsequent types of analysis.
When a number of putative substantive models for the outcome Y are of interest, the SMC-FCS algorithm could be used to impute the partially observed covariates assuming a general model for f (Y |X, Z, ψ) which contains as special cases the various putative substantive models.This approach would mean the covariate models used would be compatible with this larger model, and semi-compatible with those substantive models nested within it.This advice follows that given by others (e.g.[6,7]) for application of MI in general, whereby imputation models are used which are rich and do not impose assumptions which are subsequently to be relaxed in substantive models.For example, if based on contextual knowledge and preliminary data analysis it is thought that two partially covariates may interact in their effect on Y , one could impute under a model f (Y |X, Z, ψ) which includes the corresponding interaction.
As noted in Section 1 in many settings auxiliary variables V may be available, which although not involved in the substantive model, may be useful for inclusion in imputation models in order to improve efficiency (by virtue of their association with variables being imputed) or to increase the plausibility of the MAR assumption.The notion of compatibility between imputation and substantive models does not then apply, since the two models involve different sets of variables.However,fully observed auxiliary variables Z could be included as additional fully observed covariates (i.e.incorporated as part of Z).

Discussion
Multiple imputation is an attractive approach for handling missingness in covariates of regression models.In many settings standard choices of covariate imputation models are compatible with the substantive model, rendering our proposed approach unnecessary.However for certain substantive models default imputation models in MI software may be incompatible and therefore mis-specified (assuming the substantive model is correctly specified).This is particularly likely to be the case for substantive models which contain non-linear covariate effects or interactions.
Our proposed modification of the popular FCS approach to MI ensures that each covariate is imputed from a model which is compatible with the substantive model.Although compatibility does not guarantee the imputation models are correctly specified, it ensures that the imputation and substantive models do not make conflicting assumptions which may induce bias in parameter estimates.
In special cases SMC-FCS is equivalent to imputation from a Bayesian joint model, and thus inherits the latter's properties.More generally, if the covariate models used in SMC-FCS are mutually compatible and correctly specified we conjecture that the resulting estimator is consistent, which is supported by our simulation results.Further, in these cases confidence interval coverage for estimates from SMC-FCS attained nominal coverage, despite the lack of equivalence to imputation from a Bayesian joint model.In simulations in which the covariate models were mis-specified, estimates from SMC-FCS were still less biased than those from what might be considered 'standard FCS'.
For linear substantive models which contain non-linear covariate effects or interactions, the 'just another variable' (JAV) approach is attractive, and is consistent if data are missing completely at random.This holds irrespective of the joint distribution of the outcome and covariates.However when data are MAR, or for other substantive model types such as logistic regression, we and Seaman et al [10] have shown that JAV gives biased estimates.At least in our limited simulation study, the polynomial combination method recently proposed by van Buuren [21] was superior to JAV, with less bias and coverage closer to the nominal level.A limitation of this approach however is that it is only applies to imputation of covariates which have a quadratic association with outcome, and it is unclear whether it can be generalised to substantive models other than linear regression.
Relative to standard FCS MI, SMC-FCS is more computationally intensive because of the use of rejection sampling to sample from the required densities.For example, the SMC-FCS algorithm took six times longer than standard FCS to create 10 imputations for a simulated dataset from the first simulation scenario (linear regression with quadratic covariate effects).The acceptance rate of the rejection sampler will be low when the target density f (X j |X −j , Z, Y ) differs substantially from the candidate density f (X j |X −j , Z).This will occur if a subject has an outcome value Y which is unlikely to have occurred given the values of X −j and Z. However our experience thus far in simulation studies has been that this has not been an issue.Furthermore, as for standard FCS MI, additional work is needed to understand the statistical properties of the SMC-FCS algorithm.In some settings substantive models may be fitted to imputed datasets for a number of different outcomes, and a limitation of our approach is that imputation models are defined with respect to a single (possibly multivariate) outcome variable.
We note that compatibility between imputation and substantive models is closely related to the concept of congeniality defined by Meng [6].We chose not to adopt this term because Meng's definition of congeniality depends additionally on specification of incomplete and complete data 'procedures' which give asymptotically equivalent inferences to those under a Bayesian model.Further, in many cases (e.g. when a logistic regression model is used to impute a covariate) SMC-FCS is not equivalent to imputation from a joint model, and so would not satisfy Meng's definition of congeniality.Lastly, the setup adopted by Meng assumed that covariates are fully observed.
In this paper we have assumed that the outcome is fully observed.In the absence of auxiliary variables subjects with missing outcome provide little or no additional information regarding the substantive model parameters [22], such that imputation of missing outcomes may not be beneficial.Nevertheless, the SMC-FCS algorithm can be readily extended to impute missing outcome values by imputing from the assumed substantive model.
A Stata program implementing SMC-FCS for linear, logistic, and Cox proportional hazards models of interest is available for free download from www.missingdata.org.uk.The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California San Francisco.ADNI is the result of efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada.
The initial goal of ADNI was to recruit 800 adults, ages 55 to 90, to participate in the research, approximately 200 cognitively normal older individuals to be followed for 3 years, 400 people with MCI to be followed for 3 years and 200 people with early AD to be followed for 2 years.For up-to-date information, see www.adni-info.org.
Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904).ADNI is funded by the National Institute on Let x obs j and x mis j denote the vectors of observed and missing values in X j for the n subjects.Let y and z denote the vector and matrix of (fully observed) values of Y and Z across the n subjects.The FCS algorithm begins by replacing the missing values in each X j by randomly selected observed values from the same variable.The algorithm then proceeds by repeatedly imputing the missing values in each variable, at each stage conditioning on the most recent imputations of the other variables.Let x mis(t) j denote the imputations of the missing values x mis j at iteration t and let x (t) j = (x obs j , x mis(t) j was supported by a grant from the ESRC Follow-On Funding scheme (RES-189-25-0103) and MRC grant G0900724.S. Seaman and I. White were supported through a Medical Research Council grant (MC US A030 0015) and unit programme (U105260558).J. Carpenter was supported by ESRC Research Fellowship RES-063-27-0257.Data used in the preparation of this article were obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu).The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a $60 million, 5-year public-private partnership.The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimers disease (AD).Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.

Table 1 :
Baseline characteristics of n = 382 ADNI subjects with MCI at baseline

Table 2 :
[10]lation results -linear regression with quadratic covariate effects.Empirical mean (SD) of estimates of quadratic coefficient β 2 = 1 from 1,000 simulations, using linear imputation (linear), just another variable imputation (JAV), the polynomial combination method, and SMC-FCS.Empirical coverage of nominal 95% confidence intervals is also shown (Cov).Monte-Carlo errors for means and SDs are less than 0.003, except for log-normal X MAR, where Monte-Carlo errors for means and SDs are less than 0.02.Monte-Carlo errors for confidence interval coverage are less than 1.6%.With normal X and MAR, linear imputation gave biased estimates and the CI for β 2 had zero coverage.With data MAR, JAV no longer gave unbiased estimates, in agreement with the findings of[10], and the CI for β 2 had only 18% coverage.Polynomial combination had only slight bias, but CI coverage for β 2 was only 75.9%.In contrast, SMC-FCS was unbiased and the CI for β 2 had approximately 95% coverage.All estimators were considerably more variable with X log-normal MAR.JAV and polynomial combination had considerable bias and poor CI coverage ). Monte-Carlo errors in means and SDs are no more than 0.02 for n = 100 and 0.005 for n = 1000.

Table 5 :
Estimates of log hazard ratios (standard errors) for Cox proportional hazards model relating hazard of conversion to AD to baseline risk factors.Estimates based on complete case, FCS imputation, and SMC-FCS.