REALCOM-IMPUTE Software for Multilevel Multiple Imputation with Mixed Response Types

Multiple imputation is becoming increasingly established as the leading practical approach to modelling partially observed data, under the assumption that the data are missing at random. However, many medical and social datasets are multilevel, and this structure should be reﬂected not only in the model of interest, but also in the imputation model. In particular, the imputation model should reﬂect the diﬀerences between level 1 variables and level 2 variables (which are constant across level 1 units). This led us to develop the REALCOM-IMPUTE software, which we describe in this article. This software performs multilevel multiple imputation, and handles ordinal and unordered categorical data appropriately. It is freely available on-line, and may be used either as a standalone package, or in conjunction with the multilevel software MLwiN or Stata .


Introduction
Multiple imputation is becoming increasingly established as the leading practical approach to analysing partially observed datasets (Sterne, White, Carlin, Spratt, Royston, Kenward, Wood, and Carpenter 2009;Klebanoff and Cole 2008).Although there are now an increasing number of software packages around, they vary in their accessibility to data analysts.More fundamentally, some software uses the full conditional specification approach.For an early example see van Buuren, Boshuizen, and Knook (1999), which does not explicitly model the joint distribution but forms univariate models for each incomplete variable in turn conditional on all the others.There is no guarantee in general that these correspond to a proper joint model.
Other software is based on an explicit joint model, as described for example in Schafer (1997).
Moreover, some software treats discrete data as continuous in the imputation model, and most packages do not allow for multilevel structure (Kenward and Carpenter 2007).
In this paper, we describe the REALCOM-IMPUTE software we have developed.This is an extension of the REALCOM software that we have developed to fit multivariate multilevel mixed response models (Goldstein, Carpenter, Kenward, and Levin 2009).Using a multivariate latent normal model allows us to properly handle discrete data, and also naturally allow for two-level structure.In particular, our software allows us to properly handle 'level 2' variables, which are constant over the observations at level 1.This work builds on that described in Carpenter and Goldstein (2005), where we described macros for multilevel multiple imputation in MLwiN (Centre for Multilevel Modelling.2011) which use a multilevel normal imputation model.In that paper, we also demonstrated the need for multiple imputation to respect the multilevel structure of the data, and the multilevel structure in the model of interest, in order to avoid biasing the parameter estimates in the multilevel model and producing potentially invalid estimates of precision.
As before, a further aim in our development has been to build on the unique Equations window in MLwiN to make the process of multiple imputation, together with both the imputation model and the model of interest, as accessible as possible.We believe this is the key for data analysts, who may not have sufficient statistical training to use multiple imputation appropriately.
The article is structured as follows.In Section 2 we describe our example multilevel data set, and in Section 3 we give more details about our REALCOM-IMPUTE software.Section 4 describes the use of the software, and we conclude with a discussion in Section 5.

Example data set
To illustrate our approach we will use data from the class size study, made available to us by Peter Blatchford at the Institute of Education, London.This study sought to understand the effect of class size on development of literacy and numeracy skills in the first two years of English childrens' full time education.The analysis below is illustrative; for a fuller analysis and more details of the study see Blatchford, Goldstein, Martin, and Browne (2002).
The version of the dataset we analyse below relates to children in their first year of full time education in the UK, known as the reception year.Table 1 shows the five variables in the dataset, which is available with this article.Four measure literacy and numeracy skills when children start their reception class and at the end of their reception year, and the fifth is class size.nlitpre, nmatpre, nlitpost and nmatpost.This was done as follows.For each test, the pupils' results were ranked.Then for observation in rank order i, where n pupils sat the test, the normalized result was calculated as the inverse normal of i/(n + 1).Since many pupils got the same marks there are ties in the data.After this transformation, Figure 1 shows the test score data data are approximately normal.
The class size variable originally could change within classes, since in England children enter their first year class either in September or after Christmas, depending on their birthday.For the analysis here we avoid this additional complication by, for each class, calculating the class size as the average average class size over all the children in the class at the end of the first year.Where this could not be calculated, because class size was missing for one or more children in the class, we set class size to missing.We note that class size is not just the count of the number of children with the same class identifier in the dataset.Among other reasons this arises because not all children in a class were necessarily included in the study.In the original paper reporting the results, the effect of class size was modelled using a spline (Blatchford et al. 2002).To simplify the analysis, and illustrate how our software can handle categorical data, we here use a four-category version of class size.
The dataset is thus multilevel, with children at level 1 belonging to classes at level 2. Class size is a level 2 variable.
Parameter estimates from fitting this, our model of interest, to the 5033 complete cases are shown in Table 3 and the top panel of Figure 5.As expected, we see that literacy and numeracy when children start school strongly predicts their score at the end of the year.Increasing class size seems to reduce achievement at the end of the year, and this just passes significance at the 5% level if the class size is 25 or over.

REALCOM-IMPUTE software
The REALCOM-IMPUTE software is a free standing package, designed to have a smooth interface with MLwiN, although it can be used with any package.It fits multivariate response models to 2-level data, allowing for both level 1 and level 2 variables, and through this allows proper imputation of missing data.Continuous data are modelled using the multivariate normal distribution.The default is to have all the variables as responses, although fully observed variables can be included as covariates; in this way interactions with fully observed variables may be handled.For each level 1 response a mean and level 2 random intercept is fitted, together with a level 1 residual.For level 2 variables, only a mean and level 2 residual is fitted.Level 1 and level 2 residuals are assumed independent, with mean zero, with separate covariance matrices.If all variables are normal these are unstructured; otherwise these have appropriate structures for the latent normal model for discrete data (Goldstein et al. 2009).
As an illustration, consider multiple imputation for model (1), fitted to the class size data.
In this example, there are three level-1 variables which (as they have been transformed) we treat as normal, and a level 2 variable, class size, which we treat as unordered categorical.
In addition, we include literacy score at the end of the first year, nlitpost, as an auxiliary variable.That is, it is included in the imputation model both to increase the plausibility of the underlying assumption that data are missing at random and because it is a strong predictor of the other variables.The details of how to use the software are described in the next subsection.We first describe the model that REALCOM-IMPUTE will propose in the absence of the level 2 variable csize.
We then describe how this is extended when we introduce class size as an unordered categorical variable at level 2. As above, let i index children and j index class.REALCOM-IMPUTE proposes the following model (subscripts of random effects and coefficients are chosen to match the software output): where Ω cts u is an unstructured 4 × 4 covariance matrix; where Ω e is an unstructured 4 × 4 covariance matrix. (2) We note that this imputation model is equivalent to a conditional model for each variable in which it is linearly regressed on all the other variables.
These latent variables are included in (2) at level 2 by adding where The overall covariance matrix of the level 2 random effects is then with the 4-by-3 submatrix Ω cont,cat u unstructured.For full details of this model see Goldstein et al. (2009).This extended model remains equivalent to a conditional model for each (latent) variable in which it is linearly regressed on all the other variables.
We can add covariates to this model, provided they have no missing data, and these covariates can have separate coefficients for each response, which may have random effects at level 2. Binary and discrete data can either be treated as normal-possibly after transformation (Bernaards, Belin, and Schafer 2007; Lee and Carlin 2010)-or, preferably in our view, properly modelled.Again, we use a latent normal structure to do this, via a probit link function, with the probit analogue of the proportional odds model for ordinal data.We reiterate that all these discrete variables can be included at either level 1 or two; the appropriate constraints on the covariance matrices are implemented automatically.
Once specified, the REALCOM-IMPUTE software fits the model using Markov Chain Monte Carlo.We use a Gibbs sampling approach, updating each set of parameters in turn, conditional on the others.Where possible, we sample direct from the appropriate conditional distribution.Otherwise we use Metropolis steps or rejection sampling.Full details are given in Goldstein et al. (2009).The user has the option of requesting plots of parameter chains, and these are displayed and updated as the MCMC sampler runs.After the user-specified burn in, the software displays the parameter estimates for the joint model.If requested, it will then continue updating the sampler, imputing the missing data, and creating a file of imputed datasets stacked vertically.
The REALCOM-IMPUTE software has recently been extended to allow for weights (e.g., survey weights) both at level 1 and at level 2 and this is the subject of a forthcoming paper.

Obtaining and using the REALCOM-IMPUTE software
The program MLwiN can be downloaded from http://www.cmm.bris.ac.uk/, and is freely distributed to the UK academic community.The REALCOM-IMPUTE software is free for all users and can be downloaded from http://www.cmm.bristol.ac.uk/research/Realcom/ index.shtml.Code to export data from Stata (StataCorp.2011) to REALCOM-IMPUTE and vice-versa is available from http://www.missingdata.org.uk/.We now describe the steps required to obtain the results presented in Section 4 below; more detailed instructions are available in the on-line manual that comes with the software.
Multiple imputation in the MLwiN ↔ REALCOM-IMPUTE framework proceeds as follows: 1. Fit the model of interest in MLwiN.By default, this will only use complete cases for the estimation.Doing this for the class size data gives the results shown in the top panel of Figure 5 (parameter estimates and standard errors shown in green).
2. In MLwiN, from the toolbar, select Model -> Imputation -> Save Imputation Specification.A dialogue will open, asking for the number of response variables, number of auxiliary variables (these are included as covariates in the imputation model, and should not have any missing data) and the level 2 identifier.Then the user must specify the names of the response and auxiliary variables (including the constant) from a dropdown menu.In addition, the user must specify the 'type' for each response (continuous, ordinal, or unordered categorical).Binary variables may be specified as either ordinal or unordered categorical.
The resulting dialogue for the class size analysis is shown in Figure 2. We first specify the number of response variables (all the variables in the model of interest which have missing observations, together with any partially observed auxiliary variables we wish to bring into the imputation), the number of fully observed variables (including any fully observed auxiliary variables) and the column identifying the level 2 groups.Here, these are classes, and the class identifier is clsnr.The four variables in the model of interest, nmatpost, nmatpre, nlitpre and csize are all responses, together with the (partially observed) auxiliary variable nlitpost.The only fully observed auxiliary variable in this setting is the intercept, CONS, i.e., a vector of 1's.A window appears.Clicking on open data file prompts for a file name for loading the data file created by MLwiN.Then, click on the show equations box, to display the proposed imputation model (right panel of Figure 3).In most cases, this imputation model will be the appropriate one (given the data exported from MLwiN).If not, the Model Specification part of the REALCOM-IMPUTE dialogue, allows the user to change the variable which identifies clusters, add/remove response variables, explanatory variables and random coefficients.In addition, we can constrain certain coefficients.
For our example, the right hand panel of Figure 3 shows the imputation equation.We see that the top four responses are modelled as normal, with level 2 and level 1 residuals.Class size is modelled as an unordered categorical.We do not model it as an ordinal variable, in order to allow for the fact it may not satisfy the proportionality assumption.Note the index c on π c,5j , which indexes category (c = 2, 3, 4 as category 1 is a reference) and enables a more concise presentation of the model.
Once the imputation model is specified as desired, details of the MCMC estimation can be specified using the Estimation part of the dialogue.In particular, clicking on MCMC estimation settings allows specification of the 'burn in' for the MCMC sampler, total number of iterations (post burn in) and the refresh rate.We now discuss these in more detail.
The 'burn in' should be long enough for the sampler to leave behind the initial values and converge to the posterior distribution (see Gilks, Richardson, and Spiegelhalter 1996, for example).This occurs with fewer updates if variables are mean-centred.While more complex models require a longer burn in, we believe 2000 is plenty and 500 may be sufficient for simpler models.
We wish successive imputed datasets to be (approximately) independent draws from the distribution of the missing data given the observed.In practice we believe that 500 updates in between imputations is sufficient to achieve this.If we adopted this, for 20 imputations we would need to specify 10,000 updates.
Lastly, the refresh rate is the frequency with which the graphical monitor of the MCMC process updates.This entails a (non-statistical) computational overhead.Too low a value slows the software considerably.We therefore recommend a value between 50 and 100.
To exit the MCMC estimation settings dialogue, click on Done.Next click on Monitor to specify which of the parameter chains are to be monitored.Again, monitoring all chains slows the software; we often focus on variance parameters and parameters relating to discrete variables.
Then, click on Impute to specify the iterations of the MCMC sample at which an imputed dataset will be generated; for example with 10,000 post-burn in iterations of (1) to the complete cases (using restricted maximum likelihood); bottom results of multiple imputation.See also columns 2, 3 of Table 3.
the MCMC sampler, 20 imputations could be created at iterations 500, 1000, 1500, . . ., 10,000 (see the start of Section 4 for a discussion on the number of imputations).Within the Impute dialogue, click on the icon to specify the file name for the imputed data.Figure 4 shows this dialogue for our imputation for the class size data.
To start REALCOM-IMPUTE , return to the left hand dialogue and click on Start MCMC Run 4. When REALCOM-IMPUTE has finished, return to MLwiN and, from the toolbar, select Model -> Imputation -> Retrieve Imputation.This prompts for the name of the file of imputed data created by REALCOM-IMPUTE.
5. The final step is to fit the model of interest to each imputed dataset.To do this, from the MLwiN toolbar select Model -> Imputation-> Start Analysis.MLwiN then fits the model of interest to each imputed data set, combines the results using Rubin's rules, and displays them in the Equations window (in blue).
The bottom panel of Figure 5 shows the results of multiple imputation, and is discussed in more detail in Section 4.

Additional REALCOM-IMPUTE commands
The above commands are sufficient to use REALCOM-IMPUTE with MLwiN for multiple imputation.However, there are a number of additional commands which can be called through the buttons shown in the left panel of Figure 3.We now briefly describe these; more detail and examples are given in the online manual available with the software.
The Level-2 identifier buttons allow us to change the variable which identifies level 2 units.The Responses buttons allow us to add/remove responses and specify whether they are continuous, ordinal or unordered categorical.The Explanatory variables buttons allow adding/removing explanatory variables and adding/removing random coefficients at level 2; we can also fix a β coefficient to be zero if desired.Lastly, the Results Save/Load display the parameter estimates for the imputation model, and allow the specified model to be saved and subsequently retrieved.

Application to class size data
Here we describe the results of multiple imputation using the REALCOM-IMPUTE software.We performed multiple imputation using the 7394 partially observed cases.
We have already noted that the results of fitting the model of interest (1) to the 5033 complete cases are shown in the Table 3.We followed the steps in Subsection 3.1, and took literacy score at the end of the first year as an auxiliary variable.Using a burn in of 2000 and 500 further updates between each of 20 imputations, multiple imputation gave the parameter estimates in the righthand column of Table 3.For a practical discussion of the number of imputations, as well as issues to consider in building imputation models, see Spratt, Sterne, Tilling, Carpenter, and Carlin (2010).
Relative to the complete case analysis, after multiple imputation we see similar estimated coefficients for pre-school numeracy and literacy (β 1 , β 2 ).However, the effect of class size is weaker after multiple imputation.Classes over 25 have a lower post reception maths score on average, but this does not quite reach significance at the 5% level (β 3 , β 4 ).

Discussion and conclusion
We have described standalone REALCOM-IMPUTE software for performing multiple imputation on 2-level data, and illustrated its use on data from a study of the effect of class size on achievement among children in their first year at school.
Assuming the data are missing at random given the variables in the model, we find the effect of class size on post-reception maths scores, adjusted for pre-reception literacy and numeracy, is slightly weaker than the complete cases analysis suggests: after multiple imputation classes over 25 have a lower post reception maths score on average, but this does not quite reach significance at the 5% level.
As described above, our software properly models partially observed data at level 1 and level 2, and can handle binary, discrete and ordinal data at both levels.Further details of the underlying model and estimation, together with a confirmatory simulation study, are given in Goldstein et al. (2009).
Our approach relies on joint modelling rather than full conditional specification.We have adopted this approach for several reasons.First, we believe that explicitly modelling the multilevel structure is a natural way to handle missing observations at level 2, congenial with the model of interest we wish to fit to the data.Indeed, the practicality of a full conditional approach in a non-trivial multilevel setting is unclear.Within this framework, the latent normal model provides a natural extension to binary, ordinal and categorical data; this is also implemented in REALCOM-IMPUTE.The two-level joint model we have implemented could thus be naturally extended-for example to allow for weighting at each level, to allow for more levels in the hierarchy, and to allow for cross classified data.Second, we believe the joint model is more robust to potentially redundant variables in the imputation model.This problem is most acute with binary and categorical variables, and the chance of it occurring increases with the size of the data set.In other words, we believe this approach is more likely to scale up to routine use by data analysts on large datasets.
Multiple imputation involves the specification of a joint model for the data, either implicitly or explicitly.Our software uses the idea behind the unique Equations window in MLwiN to accessibly display this information.We believe our imputation model is extremely flexible, and one of few approaches that allows for complex multilevel variance structure and missing data in level 2 variables.However, a flexible imputation model alone is not sufficient; for multiple imputation point and interval estimates to be (first-order) unbiased the joint distribution of the full set of study variables needs to be congenial with the conditional model of interest (Meng 1994).In particular, as discussed in Section 3 conditional relationships among responses in our imputation model are linear and thus will be uncongenial with models of interest which include non-linear relationships.However, if the dependent variable in a non-linear relationship is (almost) fully observed, it can be included as a covariate in the imputation model.Another possible source of uncongeniality is that the model of interest is a logistic regression, whereas the REALCOM-IMPUTE model uses probit regression.However, as the probit and logit links are close when probabilities are away from 0 and 1, we believe this is unlikely to be an issue in practice.Multiple imputation puts the onus on the analyst to devise an appropriate imputation model-it is therefore thought-intensive as well as computer-intensive.
The software handles generalized linear models (including negative binomial) as the model of interest, as well as multivariate response models of interest.Currently, the speed of the matlab code is a limitation, but we are working to address this.We are also working on an extension to handle simple non-linear relationships and interactions.
In conclusion, we have described software for multiple multilevel imputation.Previously, we have shown that respecting the multilevel structure in the imputation is important to avoid biasing parameter estimates (Carpenter and Goldstein 2005).This is particularly the case if data are unbalanced, a typical feature of educational data.Our software is standalone, but designed to interface easily with MLwiN.Code for interfacing with R (R Development Core Team 2011) is in preparation; an interface from Stata is available from http://www.missingdata.org.uk/.We have demonstrated the use of our software on data from a study on the effect of class size on children in their first year at school, showing that it can naturally handle missing data at level 2. We encourage readers to download the software and explore it themselves.

Figure 1 :
Figure 1: Histograms of literacy and numeracy variables.

Figure 2 :
Figure 2: MLwiN dialogue box for exporting data to the REALCOM-IMPUTE software.

Figure 4 :
Figure 4: REALCOM-IMPUTE software: Specifying the iterations at which imputations are generated.

Figure 5 :
Figure 5: Screen shots of MLwiN Equations window.Top: Fitting the model of interest (1) to the complete cases (using restricted maximum likelihood); bottom results of multiple imputation.See also columns 2, 3 of Table3.
The respective literacy and numeracy test scores have been normalized to create the variables csize Categorical class size variable: 1 is ≤ 19; 2 is 20-24; 3 is 25-29; 4 is ≥ 30 Table 1: Description of variables in class size data used in this analysis.

Table 3 :
Parameter estimates from fitting model (1) to complete cases, and after multiple imputation.The model of interest is fitted using restricted maximum likelihood.