Multiple Imputation Using SAS Software

Multiple imputation provides a useful strategy for dealing with data sets that have missing values. Instead of ﬁlling in a single value for each missing value, a multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. These multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. No matter which complete-data analysis is used, the process of combining results of parameter estimates and their associated standard errors from diﬀerent imputed data sets is essentially the same. This process results in valid statistical inferences that properly reﬂect the uncertainty due to missing values. This paper reviews methods for analyzing missing data and applications of multiple imputation techniques. This paper presents the SAS/STAT MI and MIANALYZE procedures, which perform inference by multiple imputation under numerous settings. PROC MI implements popular methods for creating imputations under monotone and nonmonotone (arbitrary) patterns of missing data, and PROC MIANALYZE analyzes results from multiply imputed data sets.


Introduction
Most SAS statistical procedures exclude observations with any missing variable values from the analysis. Although using only complete cases is simple, information that is in the incomplete cases is lost. Excluding observations with missing values also ignores the possible systematic difference between the complete cases and incomplete cases, and the resulting inference might not be applicable to the population of all cases, especially with a smaller number of complete cases.
There are several approaches to handling missing data. The first approach uses all available data, which ignores any incomplete data in the cases. For example, the CORR procedure estimates a variable mean by using all cases with nonmissing values for this variable, ignoring the possible missing values in other variables. The CORR procedure also estimates a correlation by using all cases with nonmissing values for this pair of variables. This estimation might make better use of the available data, but the resulting correlation matrix might not be positive definite.
Another approach is single imputation, in which a value is substituted for each missing value. Standard statistical procedures for complete data analysis can then be used with the filled-in data set. For example, each missing value can be imputed from the variable mean of the complete cases. This approach treats missing values as if they were known in the completedata analyses. Single imputation does not reflect the uncertainty about the predictions of the unknown missing values, and the resulting estimated variances of the parameter estimates are biased toward zero.
Instead of filling in a single value for each missing value, a multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute (Rubin 1987). The multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. No matter which complete-data analysis is used, the process of combining results from different data sets is essentially the same.
Multiple imputation does not attempt to estimate each missing value through simulated values, but rather to represent a random sample of the missing values. This process results in valid statistical inferences that properly reflect the uncertainty due to missing values; for example, valid confidence intervals for parameters.
Multiple imputation inference involves three distinct phases: The missing data are filled in m times to generate m complete data sets.
The m complete data sets are analyzed by using standard procedures.
The results from the m complete data sets are combined for the inference.
The MI procedure in SAS/STAT software is a multiple imputation procedure that creates multiply imputed data sets for incomplete p-dimensional multivariate data. It uses methods that incorporate appropriate variability across the m imputations. After the m complete data sets are analyzed by using standard procedures, the MIANALYZE procedure can then be used to generate valid statistical inferences about these parameters by combining results from the m complete data sets.

Multiple imputation methods in the MI procedure
This section describes methods that are available in PROC MI. PROC MI assumes that the missing data are missing at random (MAR)-that is, the probability that an observation is missing might depend on Y obs , but not on Y mis (Rubin 1976(Rubin , 1987. Furthermore, PROC MI also assumes that the parameters θ of the data model and the parameters φ of the missing data indicators are distinct. That is, knowing the values of θ does not provide any additional information about φ, and vice versa. If both MAR and distinctness assumptions are satisfied, the missing-data mechanism is said to be ignorable. The imputation method of choice depends on the pattern of missingness in the data and the type of the imputed variable. A data set with variables Y 1 , Y 2 , . . . , Y p (in that order) is said to have a monotone missing pattern when the event that a variable Y j is missing for a particular individual implies that all subsequent variables Y k , k > j, are missing for that individual. Table 1 summarizes the available methods.
For data sets with monotone missing patterns, the variables with missing values can be imputed sequentially with covariates constructed from their corresponding sets of preceding variables. To impute missing values for a continuous variable, one of the following methods can be used: a regression method (Rubin 1987), a predictive mean matching method (Heitjan and Little 1991;Schenker and Taylor 1996), or a propensity score method (Lavori, Dawson, and Shera 1995). To impute missing values for a classification variable, one of the following methods can be used: a logistic regression method when the classification variable has a binary or ordinal response, or a discriminant function method when the classification variable has a binary or nominal response.
For data sets with arbitrary missing patterns, a Markov chain Monte Carlo (MCMC) method that assumes multivariate normality can be used to impute missing values (Schafer 1997). The MCMC method can be used to impute either all the missing values or just enough missing values to make the imputed data sets have monotone missing patterns. A monotone missing data pattern offers greater flexibility in the choice of imputation models (such as the monotone regression method) that do not use Markov chains. A different set of covariates can also be specified for each imputed variable.
For data sets with arbitrary missing patterns, a fully conditional specification (FCS) method can also be used to impute missing values for both continuous and classification variables (Brand 1999;van Buuren 2007). The FCS method assumes the existence of a joint distribution for all variables. The method does not start with an explicitly specified multivariate distribution for all variables, but rather uses a separate conditional distribution for each imputed variable. This feature is not described further in this paper, but is described in the documentation of the MI procedure for SAS/STAT 9.3 (SAS Institute Inc. 2011b).

Methods for data sets with monotone missing data patterns
For a data set with a monotone missing data pattern, one of the following methods can be used: a regression method, a predictive mean matching method, or a propensity score method to impute missing values for a continuous variable; a logistic regression method for a classification variable with a binary or ordinal response; or a discriminant function method for a classification variable with a binary or nominal response.
For a variable with missing values, a model is fitted using observations with observed values for the variable. With this resulting model, a new model is drawn and is used to impute missing values for the variable. The missing values are imputed sequentially for variables in the order given by the VAR statement.
That is, for a variable Y j with missing values, the missing values are imputed from the distribution An example is a regression model where X 1 , X 2 , . . ., X k are the covariates generated from preceding variables Y 1 , Y 2 , . . . , Y j−1 .
The following steps are used to impute missing values for Y j at each imputation: 1. The regression model is fitted using observed values for the variable Y j and its covariates X 1 , X 2 , ..., X k . The fitted model includes the regression parameter estimatesβ = (β 0 ,β 1 , ...,β k ) and the associated covariance matrixσ 2 j V j , where V j is the usual X X inverse matrix derived from the intercept and covariates X 1 , X 2 , ..., X k .
2. New parameters β * = (β * 0 , β * 1 , ..., β * (k) ) and σ 2 * j are drawn from the posterior predictive distribution of the parameters (Rubin 1987). That is, they are simulated from (β 0 ,β 1 , ...,β k ),σ 2 j , and V j . The variance is drawn as where g is a χ 2 n j −k−1 random variate and n j is the number of nonmissing observations for Y j . The regression coefficients are drawn as where V hj is the upper triangular matrix in the Cholesky decomposition, V j = V hj V hj , and Z is a vector of k + 1 independent random normal variates.

The missing values are then replaced by
where x 1 , x 2 , ..., x k are the values of the covariates and z i is a simulated normal deviate.
The predictive mean matching method can also be used for imputation. It is similar to the regression method except that for each missing value, it imputes an observed value that is selected from the specified number of nearest observations to the predicted value from the simulated regression model (Rubin 1987). The predictive mean matching method ensures that imputed values are plausible, and it might be more appropriate than the regression method if the normality assumption is violated (Horton and Lipsitz 2001).

MCMC methods for data sets with arbitrary missing patterns
MCMC originated in physics as a tool for exploring equilibrium distributions of interacting molecules. In statistical applications, it is used to generate pseudorandom draws from multidimensional and otherwise intractable probability distributions via Markov chains. A Markov chain is a sequence of random variables in which the distribution of each element depends on the value of the previous one.
In MCMC, a Markov chain long enough for the distribution of the elements to stabilize to a common distribution is constructed. This stationary distribution is the distribution of interest. Repeatedly simulating steps of the chain simulates draws from the distribution of interest Schafer (1997).
In Bayesian inference, information about unknown parameters is expressed in the form of a posterior probability distribution. MCMC has been applied as a method for exploring posterior distributions in Bayesian inference. That is, through MCMC, the entire joint posterior distribution of the unknown quantities can be simulated and simulation-based estimates of posterior parameters can be obtained.
Assuming that the data are from a multivariate normal distribution, data augmentation is applied to Bayesian inference with missing data by repeating the following steps: 1. The imputation I-step: With the estimated mean vector and covariance matrix, the I-step simulates the missing values for each observation independently. That is, if the variables with missing values for observation i are denoted by Y i(mis) and the variables with observed values are denoted by Y i(obs) , then the I-step draws values for Y i(mis) from a conditional distribution Y i(mis) given Y i(obs) .
2. The posterior P-step: The P-step simulates the posterior population mean vector and covariance matrix from the complete sample estimates. These new estimates are then used in the I-step. Without prior information about the parameters, a noninformative prior is used. Other informative priors can also be used. For example, a prior information about the covariance matrix might help stabilize the inference about the mean vector for a near singular covariance matrix.
That is, with a current parameter estimate θ (t) at t-th iteration, the I-step draws Y (t+1) mis from p(Y mis |Y obs , θ (t) ) and the P-step draws θ (t+1) from p(θ|Y obs , Y (t+1) mis ). The two steps are iterated long enough for the results to reliably simulate an approximately independent draw of the missing values for a multiply imputed data set (Schafer 1997).

The MI procedure
PROC MI provides various methods to create multiply imputed data sets for incomplete multivariate data that can be analyzed using standard SAS procedures. Table 2 summarizes the available statements in PROC MI.
The imputation method of choice depends on the pattern of missingness in the data and the type of the imputed variable. For a data set with a monotone missing pattern, the MONOTONE statement can be used to specify applicable monotone imputation methods; otherwise, the MCMC statement can be used assuming multivariate normality.

Statement Description BY
Specifies groups in which separate sets of multiple imputations are performed CLASS Specifies the classification variables in the VAR statement EM Computes the maximum likelihood estimate (MLE) of data with missing values by expectation-maximization (EM) algorithm assuming a multivariate normal distribution FREQ Specifies the variable that represents the frequency of occurrence in the observation MCMC Specifies Markov chain Monte Carlo imputation methods MONOTONE Specifies imputation methods for a data set with a monotone missing pattern TRANSFORM Specifies the variables to be transformed in the imputation process VAR Specifies the variables to be analyzed Specifies variable means under the null hypothesis in the t-test for location Table 3: Key options in PROC MI.
The TRANSFORM statement specifies the variables to be transformed before the imputation process; the imputed values of these transformed variables are reverse-transformed to the original forms before the imputation. The Box-Cox, exponential, logarithmic, logit, and power transformations can be used for the variables. Table 3 lists key options available in the PROC MI statement. Often, as few as three to five imputations are adequate in multiple imputation (Rubin 1996). If the NIMPUTE= option is not specified, NIMPUTE=5 is used. The OUT= option specifies the output SAS data set that includes an identification variable, _IMPUTATION_, to identify the imputation number.

MONOTONE statement
The MONOTONE statement specifies monotone methods to impute variables for a data set with a monotone missing pattern. A VAR statement must be specified, and the data set must have a monotone missing pattern with variables ordered in the VAR list.   The DETAILS option in the REG option displays the regression coefficients in the regression model that are estimated from the observed data and the regression coefficients that are used in each imputation. Similarly, the DETAILS option in the LOGISTIC option displays the regression coefficients in the logistic regression model that are estimated from the observed data and the regression coefficients that are used in each imputation.

MCMC statement
The MCMC statement uses a Markov chain Monte Carlo method to impute values for a data set with an arbitrary missing pattern, assuming a multivariate normal distribution for the data. Table 5 summarizes the key options available for the MCMC statement.
The key options for the imputation details are:

CHAIN=SINGLE | MULTIPLE:
The CHAIN= option specifies whether a single chain (CHAIN=SINGLE) is used for all imputations or a separate chain (CHAIN=MULTIPLE) is used for each imputation (Schafer 1997

NBITER=numbers:
The NBITER= option specifies the number of burn-in iterations before the first imputation in each chain. The default is NBITER=200.
NITER=numbers The NITER= option specifies the number of iterations between imputations in a single chain. The default is NITER=100.

Example 2: MCMC method for arbitrary missing pattern data
This example uses the MCMC method to impute missing values for variables in a data set with an arbitrary missing pattern. The following Fitness data set has been altered to contain an arbitrary missing pattern. These measurements were made on men involved in a physical fitness course at N.C. State University. Certain values have been set to missing and the resulting data set has an arbitrary missing pattern. Only selected variables of Oxygen (intake rate, ml per kg body weight per minute), Runtime (time to run 1.5 miles in minutes), RunPulse (heart rate while running) are used. The following statements use the MCMC method to impute missing values for all variables in a data set. The resulting data set is named OutFitness. These statements also create an iteration plot for the successive estimates of the variable Oxygen and an autocorrelation function plot for Oxygen. ods graphics on; proc mi data=Fitness nimpute=4 seed=501213 mu0=50 10 180 out=OutFitness; em; mcmc plots=(trace(mean(Oxygen)) acf(mean(Oxygen)));   The expectation-maximization (EM) algorithm is a technique that finds maximum likelihood estimates for parametric models for incomplete data (Little and Rubin 2002). By default, the procedure uses the statistics from the available cases in the data as the initial estimates for EM algorithm, and the correlations are set to zero. With the EM statement, the initial parameter estimates for the EM algorithm and the resulting maximum likelihood estimates are displayed. The EM algorithm can also be used to compute posterior modes, the parameter estimates with the highest observed-data posterior density. These posterior modes are used to begin the MCMC process. After the completion of the specified four imputations, the Variance Information table displays the between-imputation variance, within-imputation variance, and total variance for combining complete-data inferences.  With the TRACE (MEAN(OXYGEN)) option, the procedure displays a trace plot for the mean of Oxygen, as shown in Figure 1. The plot shows no apparent trends for the variable Oxygen.

EM (Posterior
With the ACF (MEAN(OXYGEN)) option, an autocorrelation plot for the mean of Oxygen is displayed, as shown in Figure 2. It shows no significant positive or negative autocorrelation.
The following statements list the first 10 observations of the output data set OutFitness: proc print data=OutFitness(obs=10); title 'First 10 Observations of the Imputed Data Set'; run; First 10 Observations of the Imputed Data Set

Combining inferences from imputed data sets
With m imputations, m different sets of the point and variance estimates for a parameter Q can be computed. LetQ i andÛ i be the point and variance estimates from the ith imputed data set, i=1, 2, ..., m. Then the point estimate for Q from multiple imputations is the average of the m complete-data estimates: Let U be the within-imputation variance, which is the average of the m complete-data estimates U = 1 m m i=1Û i and B be the between-imputation variance Then the variance estimate associated with Q is the total variance The statistic (Q − Q)T −1/2 is approximately distributed as a t distribution with v m degrees of freedom (Rubin 1987 When the complete-data degrees of freedom v 0 is small and there is only a modest proportion of missing data, the computed degrees of freedom, v m , can be much larger than v 0 , which Statement Description BY Specifies groups in which separate sets of multiple imputations are performed CLASS Specifies classification variables in the MODELEFFECTS statement MODELEFFECTS Lists the effects in the data set to be analyzed STDERR Lists standard errors associated with effects TEST Tests linear hypotheses about the parameters is inappropriate. Barnard and Rubin (1999) recommend the use of an adjusted degrees of

The MIANALYZE procedure
From m imputations, m different sets of the point and variance estimates for a parameter Q can be computed. PROC MIANALYZE combines these results and generates valid statistical inferences about the parameter. Multivariate inferences can also be derived from the m imputed data sets. Table 6 lists available statements in PROC MIANALYZE.
The MODELEFFECTS statement lists the effects in the data set to be analyzed. Each effect is a variable or a combination of variables, and is specified with a special notation using variable names and operators.
The STDERR statement lists standard errors associated with effects in the MODELEFFECTS statement, when the input DATA= data set contains both parameter estimates and standard errors as variables in the data set.
The TEST statement tests linear hypotheses about the parameters β. An F test is used to test jointly the null hypotheses (H 0 : Lβ = c) specified in a single TEST statement.  The following statements use the MIANALYZE procedure to read parameter estimates in the PARMS= data set and the associated covariance matrix in the COVB= data set: proc mianalyze parms=lgparms covb=lgcovb; modeleffects Intercept Length; run; The Model Information table lists the input data sets and the number of imputations. The Variance Information table displays the between-imputation, within-imputation, and total variances for combining complete-data inferences.  This example creates an EST-type data set that contains regression coefficients and their corresponding covariance matrices computed from imputed data sets. These estimates are then combined to generate valid statistical inferences about the regression model.
The following statements use the REG procedure to generate regression coefficients in each imputed data set stored in OutFitness: proc reg data=OutFitness outest=regest covout noprint; model Oxygen= RunTime RunPulse; by _Imputation_; run; The following statements display the output OUTEST= data set from PROC REG for the first two imputed data sets: proc print data=regest(obs=8); var _Imputation_ _Type_ _Name_ Intercept RunTime RunPulse; title 'REG Model Coefficients (First Two Imputations)'; run; The REG Model Coefficients (First Two Imputations) table displays regression coefficients and their covariance matrices for the first two imputed data sets. The following statements combine the results from the imputed data sets: proc mianalyze data=regest edf=28; modeleffects Intercept RunTime RunPulse; run;

REG
The EDF= option is specified to request that the adjusted degrees of freedom be used in the analysis. For a regression model with three independent variables (including the Intercept) and 31 observations, the complete-data error degrees of freedom is 28.
The Model Information table lists the input data set and the number of imputations. The Variance Information table displays the between-imputation, within-imputation, and total variances for combining complete-data inferences.