Journal Pre-proof Missing Data in Clinical Research: A Tutorial on Multiple Imputation

Missing data is a common occurrence in clinical research. Missing data occurs when the value of the variables of interest are not measured or recorded for all subjects in the sample. Common approaches to addressing the presence of missing data include complete-case analyses, in which subjects with missing data are excluded, or mean-value imputation, where missing values are replaced with the mean value of that variable in those subjects for whom it is not missing. However, in many settings, these approaches can lead to biased estimates of statistics (e.g., of regression coefficients) and/or to confidence intervals that are artificially narrow. Multiple imputation (MI) is a popular approach for addressing the presence of missing data. With MI, multiple plausible values of a given variable are imputed or filled-in for each subject who has missing data for that variable. This results in the creation of multiple completed datasets. Identical statistical analyses are conducted in each of these complete datasets and the results are pooled across complete datasets. We provide an introduction to MI and discuss issues in its implementation, including developing the imputation model, how many imputed datasets to create, and addressing derived variables. We illustrate the application of MI through an analysis of data on patients hospitalized with heart failure. We focus on developing a model to estimate the probability of one-year mortality in the presence of missing data. Statistical software code for conducting multiple imputation in R, SAS, and Stata are provided.


Brief summary
Missing data is a common occurrence in clinical research. We briefly discuss common approaches to addressing missing data and highlight their limitations. We introduce multiple imputation (MI), a popular approach for addressing the presence of missing data. With MI, multiple plausible values of a given variable are imputed or filled-in for each subject who has missing data for that variable. We describe the steps that should be conducted when conducting an analysis using MI.
J o u r n a l P r e -p r o o f

Introduction
Missing data are a common occurrence in clinical research. Missing data occurs when the value of the variables of interest are not measured or recorded for all subjects in the sample.
Data can be missing for several reasons, including: (i) patient refusal to respond to specific questions (e.g., patient does not report data on income); (ii) loss of patient to follow-up; (iii) investigator or mechanical error (e.g., sphygmomanometer failure); (iv) physicians not ordering certain investigations for some patients (e.g., cholesterol test not ordered for some patients).
Before discussing different ways of addressing the presence of missing data, it is important to understand the conditions under which data are subject to being missing. Rubin developed a framework for addressing missing data and described three different missing-data mechanisms 1,2 . Data are said to be 'missing completely at random' (MCAR) if the probability of a variable being missing for a given subject is independent of both the observed and unobserved variables for that subject 3 (a list of abbreviations is provided in Table 1). If data are MCAR, then the subsample consisting of subjects with complete (or non-missing) data is a representative sub-sample of the overall sample. An example of MCAR is laboratory values that are missing because the sample was lost or damaged in the laboratory. The occurrence of such events in the laboratory is unlikely to be related to characteristics of the subject. Data are said to be 'missing at random' (MAR) if, after accounting for all the observed variables, the probability of a variable being missing is independent of the unobserved data. If physicians were less likely to order laboratory tests for older patients and that was the only factor influencing whether or not a test was ordered and recorded, then missing laboratory data would be MAR (assuming that age was recorded for all patients). Finally, data are said to be 'missing not at random' (MNAR) if they are neither MAR nor MCAR. Thus, data are MNAR if the probability of a variable being missing, even after J o u r n a l P r e -p r o o f accounting for all the observed variables, is dependent on the value of missing variable. An example of data that are MNAR could be income, in which more affluent subjects, even after accounting for other characteristics, are less likely to report their income in surveys than are less affluent subjects. Unfortunately, one cannot test whether the data are MAR vs. MNAR, so one must judge what is plausible using clinical knowledge 4,5 .
Historically, a popular approach when faced with missing data was to exclude all subjects with missing data on any necessary variables and to conduct subsequent statistical analyses using only those subjects who have complete data (accordingly, this approach is often referred to as a 'complete case' analysis). When only the outcome variable is incomplete, this approach is valid under MAR and often appropriate 6  The current paper provides an introduction to MI and illustrates its application using a cardiovascular example. The paper is structured as follows. In Section 2 we introduce MI and discuss several issues related to its implementation. In Section 3 we illustrate its application using an example of logistic regression to model mortality in patients with heart failure. Finally, in Section 4 we summarize our brief tutorial and direct the interested reader to more detailed and comprehensive discussions of MI.

Multiple imputation for missing data
In this section we provide an introduction to MI and discuss issues related to its use.

Multiple imputation using Multivariate Imputation by Chained Equations (MICE)
Fully conditional specification (FCS) is a strategy for specifying multivariate models through conditional distributions. A specific implementation of this strategy in which every variable is imputed conditional on all other variables is now known as the Multivariate Imputation by Chained Equations (MICE) 10-13 algorithm. In our description of the algorithm we assume that there are p variables, of which k are subject to missing data and p-k that are complete. The algorithm is summarized in  11 .
For a given subject with missing data on the variable in question, PMM identifies those subjects with no missing data on the variable in question whose linear predictors (created using the regression coefficients from the fitted imputation model) are close to the linear predictor of the given subject (created using the regression coefficients sampled from the appropriate posterior distribution, as described above). Of those subjects who are close, one subject is selected at random and the observed value of the given variable for that randomly-selected subject is used as the imputed value of the variable for the subject with missing data. Morris et al. suggest that identifying the ten closest subjects without missing data performed well 14 . Using the terminology of Morris et al., we refer to the method described in Section 2.2 as parametric J o u r n a l P r e -p r o o f imputation, as the imputed variables are drawn from a parametric distribution 14 . This is in contrast to PMM, where the imputed variables are drawn from an observed empirical distribution.

Analyses in the M imputed datasets
Once

Rubin's Rules for combining estimates and standard errors across imputed datasets
Once . This quantity reflects both the average within-imputation variation in θ as well as the between-imputation variation in θ . Note that when using single imputation, there is no estimate of B, so we are unable to estimate the true variation in the statistic.

How many imputations: how large should M be?
An important question is how many imputed datasets should be created. Early recommendations were that three to five imputed datasets were sufficient as long as the amount of missing information was not very high 1, 3 , while others suggested that often 5-10 imputations were sufficient 7 . These early recommendations were based on the accuracy with which the regression coefficient was estimated compared to its accuracy had it been estimated with an infinite number of imputed datasets. However, analysts are interested not only in estimated regression coefficients (e.g., log-odds ratios or log-hazard ratios), but also in their associated standard errors (which are used in deriving confidence intervals and significance tests). Thus, one wants to estimate not only regression coefficients accurately, but also standard errors. of imputed datasets should be at least as large as the percentage of subjects with any missing data 11 . They suggest that this will result in estimates of regression coefficients, test statistics (regression coefficients divided by the standard error) and p-values with minor variability across repeated MI analyses (i.e., the Monte Carlo error will be low). A more advanced method for determining the number of imputations was developed by Von Hippel 15 . Nowadays computation is cheap and the use of between 20 and 100 imputed datasets is common.

Which variables to include in the imputation model?
Investigators need to distinguish between two different statistical models: the imputation model and the analysis model. The imputation model is used for imputing missing data. It is not of direct interest and is only used to provide reasonable imputations. The analysis model holds the quantities that are ultimately of scientific interest and is the focus of the research question.
The rules for building imputation and analysis models are very different. It is important to include in the imputation model all the variables that will be included in the analysis model. It is especially important to include in the imputation model the outcome variable for the analysis model 5,11 . Failure to do so usually results in estimated regression coefficients for the analysis model being biased toward the null. When the outcome in the analysis model is a survival or time-to-event outcome (e.g., the outcome model is a Cox proportional hazards model) then there are two components to the outcome: a time-to-event variable denoting the time to the occurrence of the event or the time to censoring, and a binary indicator variable denoting whether the subject experienced the event or was censored. The recommended approach is to include both in the imputation model, with the time-to-event variable transformed using the cumulative survivor function 16 . In addition, the imputation model is improved by including variables that are related to the missingness and variables that are correlated with variables of interest. In longitudinal data, when imputing a variable for a specific measurement occasion (e.g., on the second clinic visit), one also needs to include in the imputation model future values of that variable (e.g. the value of that variable at the third clinic visit).

Imputing derived variables
The analysis model may include variables that are derived from other variables. Examples include body mass index (BMI, which is derived from height and weight), quadratic terms for continuous variables (e.g., age 2 ), and interactions between variables (i.e., products of variables  18,19 . Van Buuren (Sec 6.4) describes some alternative strategies for specific types of dependencies 10 . Since no strategy performs uniformly better, we may need some tailoring to the type of derived variable.

Missing outcome variables
Multiple imputation is blind to which variables are outcomes and which variables are

Case study
We use data on patients hospitalized with heart failure in the province of Ontario to provide a case study illustrating the application of MI. The analysis model of interest is a logistic regression model in which death within one year of hospital admission is regressed on ten patient characteristics.

Data sources
We  [22][23][24] . Our purpose in using these data was to illustrate the application of statistical methods and not to draw clinical conclusions. Accurate estimation of the association of variables with cardiovascular outcomes in current patients may require the use of more recent data and of a more comprehensive set of predictor variables. Furthermore, depending on the objective of the intended study, a different regression model may be more appropriate.

Descriptive statistics
Means and percentages are reported for the continuous and binary variables, respectively, in Table 3. We also report the percentage of subjects with missing data for each of the variables.
The percentage of missing data ranged from a low of 0% (age and sex) to a high of 73% (LDL cholesterol). Overall, 78% of subjects had missing data on at least one variable.

Comparison of subjects with and without missing data
We conducted univariate comparisons of those with and without missing data. There are at least two reasons for these comparisons. First, as noted above, the imputation model is improved by including variables that are related to the missingness. Thus, these comparisons will help identify variables that should be included in the imputation model. Second, these analyses provide evidence as to the plausibility of the MAR assumption. If those with and without missing data differ on many observed variables, then it is plausible that they may also differ on unobserved variables. Note that a lack of significant univariate associations does not provide proof that the data are MCAR or MAR.
There were meaningful differences in age, sex and mortality (the three variables that were not subject to missingness) between those with complete data and those with missing data. The average age of those with complete data was 73.7 years while it was 77.5 years for those with missing data. Of those with complete data, 43.4% were female, while 53.0% of those with missing data were female. Of those with complete data, 23.7% died within one year of admission, while 33.9% of those with missing data died within one year of admission. Patients with missing data tended to be older, were more likely to be female, and more likely to die compared to those with complete data.

Complete case analysis
We conducted a complete case analysis that was restricted is to the 1,806 subjects with complete data. The reason for doing this is that complete case analysis is less prone to user error than MI (as it does not rely on an imputation model) and we should be able to explain any J o u r n a l P r e -p r o o f differences between the complete case analysis and the MI analysis 5 . We used logistic regression to regress death within one year of hospital admission on the 10 baseline covariates.
The logarithm of the estimated odds ratios and associated 95% confidence intervals are reported in Figure 1 (log-odds ratios are reported so that the confidence intervals are symmetric).
Increasing age and urea were associated with an increased odds of death within one year and had 95% confidence intervals that excluded the null value. None of the binary variables had odds ratios whose associated 95% confidence interval excluded the null value. Note that the odds ratios for the five continuous variables are not directly comparable with one another as they are measured on different scales.

Multiple imputation
Imputation was conducted using the MICE algorithm using PROC MI in SAS (SAS/STAT version 14.1). Logistic regression models were used as the imputation models for the binary variables, while linear regression models were used as the imputation models for the continuous

Descriptive statistics in the imputed datasets
Non-parametric density plots were used to describe the distribution of the four continuous variables that were subject to missing data in the complete cases and in those subjects who were missing data for the given continuous variable. The latter was done separately in each of the imputed datasets. These are described in Figure 2 (parametric imputation) and Figure 3 (predictive mean matching). The density function in the complete cases is described using a solid black line, while the density function of the imputed variable in each of the imputed datasets is described using a dashed red line. When using parametric imputation, the distribution of imputed respiratory rate, glucose, and urea failed to display the skewness seen in subjects for whom the variable was observed. However, the distribution of imputed values of LDL was comparable to the empirical distribution in subjects for whom LDL was measured. When using predictive mean matching, the distribution of the imputed values tended to be very similar to that of the observed values of the variable.

Logistic regression in the imputed datasets
In each imputed dataset, we regressed the binary outcome denoting death within one year of hospital admission on the 10 covariates described in Table 3. The regression coefficients and their standard errors were pooled using Rubin's Rules. The estimate of the Monte Carlo error for J o u r n a l P r e -p r o o f the ten estimated regression coefficients ranged from 0.000042 for age to 0.005502 for LDL cholesterol. Thus, if we repeated the entire imputation process multiple times, we would expect to see only minor variation in the estimated regression coefficients.
The log-odds ratios and their associated 95% confidence intervals obtained using parametric imputation are reported in Figure 1. Three continuous variables (age, respiratory rate, and urea) had a positive association with 1-year mortality, while females had a lower risk of death than males. The odds ratios and associated 95% confidence intervals obtained using predictive mean matching imputation are also reported in Figure 1. The estimated odds ratios and associated confidence intervals obtained using predictive mean matching imputation were essentially identical to those obtained using parametric imputation. In comparing the results of the three regression analyses, one observes that the confidence intervals obtained from the imputation-based analyses were narrower than those obtained in the complete case analysis. For some variables (e.g., age, S3 and S4), the confidence intervals obtained using the complete case analysis were substantially wider than those obtained using multiple imputation.

Discussion
Missing data occurs frequently in clinical research. MI is a statistical tool that allows the researcher to replace missing values with multiple plausible values of the variable in question.
The use of MI allows the researcher to analyze complete datasets while incorporating the uncertainty in the imputed values of the variable. We provided a brief introduction to MI and provided guidance regarding its implementation. We illustrated the application of MI through the analysis of data on patients hospitalized with heart failure.

J o u r n a l P r e -p r o o f
When applying MI, researchers should explore differences between the observed and imputed distributions and between the complete case analyses and the MI analyses. We refer readers to previously-published guidelines for reporting analyses affected by missing data 5,25 .
The current introduction to MI was not intended to be exhaustive. We refer the interested reader to several excellent texts on MI 1-3, 10 as well as to more detailed overview articles 7,11 . We have focused our attention on multiple imputation in observational studies in which clustering of subjects or a multilevel structure is absent. Other works describe methods for using multiple imputation with multilevel data 10,[26][27][28][29] . Similarly, we have focused on the use of parametric models (e.g., logistic regression models or linear regression models) for the imputation models.
An area of current research is on the use of machine learning methods for multiple imputation 30 .
We have focused on the use of MI when data are either MCAR or MAR. The described methods must be modified if it is thought that the data are MNAR. Van Buuren summarizes different methods to address data that are MNAR 10 . The simplest approach is to assume that the distribution of a variable in those with missing is shifted compared to the distribution in those with complete data. Sensitivity analyses can be conducted in which the magnitude of the shift parameter is allowed to vary.
We have focused on the MICE algorithm for multiple imputation, along with a modification, predictive mean matching. This is not the only method to impute missing data. An earlier method has been described as 'joint modeling' 10 10 . Given the flexibility of the MICE algorithm and its ability to explicitly incorporate different types of variables, its use may be attractive to researchers in biomedical research.
In our case study, we obtained similar parameter estimates when using parametric imputation as when using predictive mean matching imputation. This is to be expected for estimates that depend on the middle of the distribution, such as means or regression coefficients.
In practice, it may be difficult to provide examples where PMM imputation beats a well-crafted parametric imputation model. However, in practice, analysts often prefer PMM imputation because it preserves typical features in the raw data. For example, it accounts for discreteness of data, avoids impossible values, preserves location of quantiles, and is highly robust to imputation model misspecification. All this costs no additional work on the part of the analyst. If the complete-data model depends on such features, then the inference will also be better when using PMM imputation.
In this tutorial article we have focused on the use of multiple imputation in observational studies. In randomized controlled trials (RCTs), multiple imputation is not always the optimal approach 6 . When a univariate outcome is MAR, a complete case analysis using an adjusted analysis is unbiased and efficient 6 . With a multivariate outcome (e.g., an outcome measured at J o u r n a l P r e -p r o o f  . c. Using the quantities obtained in (b), randomly perturb the estimated regression coefficients in a way that reflects the degree of uncertainty arising from the data. d. Using the set of perturbed regression coefficients obtained in (c), the conditional distribution of the first variable is determined for each subject with missing data on that variable. e. A value of the variable is drawn from this conditional distribution for each subject with missing data on the first variable.

Repeat
Step 3 for each of the variables that is subject to missing data. Step 3 and Step 4 form one cycle of the imputation process for creating one imputed dataset. 5. Repeat Steps 3 and 4 the desired number of times (suggested values: 5 to 20 cycles). The final imputed values are used as the imputed values in first imputed dataset. 6. Repeat Steps 2 to 5 M times to produce M imputed datasets (the choice of M, the number of imputed datasets, is discussed in Section 2.5).    Caption: The solid black line denotes the distribution of the given continuous variable in those subjects for whom that variable was not missing. The red lines denote the distribution of the imputed value for that variable in those subjects for whom the variable was missing. There is one red line for each of the imputed datasets.