Multiple Imputation Through XGBoost

Abstract The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, and this requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this article, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and nonlinear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and to better account for appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online.


Introduction
Multiple imputation (MI), first introduced by Rubin (1978), has received increasing recognition in dealing with missing data because it can reduce bias and represent the uncertainty of missing values. Given an incomplete dataset, each missing value is replaced by a set of M > 1 plausible values instead of a single value. Analyses can then be performed separately on M complete imputed datasets, and the results can be combined to yield valid inference. Rubin (1987) argued that with proper multiple imputation, the pooled estimates and variance would be statistically valid. A number of frameworks have been developed to implement MI. However, few automated procedures for large-scale imputation have been devised.
A major flaw of traditional MI implementations is that they fail to automatically capture complex relations among variables. If such relations exist, popular software packages such as mi (Su et al. 2011) and mice with default settings (van Buuren and Groothuis-Oudshoorn 2011) will produce unsatisfactory results unless users manually specify any potential nonlinear effects in the imputation model for each incomplete variable. Indeed, van Buuren and Groothuis-Oudshoorn (2011) indicated that it is crucial to include all interactions of interest in the imputation model in order to achieve optimal results. However, it appears that researchers often use mice in an automated manner (e.g. Wendt et al. 2021;Awada et al. 2021). Tree-based algorithms provide a potential solution to solve this problem. Stekhoven and Bühlmann (2012) proposed a non-parametric method for imputing missing values based on random forests (Breiman 2001) and implemented it in an R package called missForest. Doove et al. (2014) implemented classification and regression trees (CART) (Breiman et al. 1984) and random forests within the mice framework (mice-cart and mice-rf) and showed that they better preserve the non-linear effects, compared to the standard implementation of mice.
Another drawback of existing MI frameworks is the excessive computation time for large datasets. Recent advances in technology have made the collection and analysis of large datasets feasible. However, current MI methods, including those that can capture complex data structures such as mice-cart and mice-rf, are more suited for small datasets and often struggle with medium and large datasets. This problem can become unmanageable when there is a moderate number of missing values present across many variables. Recently, mice has included ranger (Wright and Ziegler 2017), a fast implementation of random forests, as one of its imputation methods. However, its imputation performance and computational time have not been investigated.
The purpose of this paper is to present a fast and automated MI procedure mixgb, which is based on XGBoost (Chen and Guestrin 2016), subsampling, and predictive mean matching (Little 1988), with a focus on yielding statistically valid results. XGBoost, a fast tree boosting algorithm, has been a frequent winner in Kaggle data competitions (Chen and Guestrin 2016) and has gained immense popularity due to its speed and accuracy. XGBoost's ability to efficiently capture complex structures of large datasets means it has great potential for automated MI. With subsampling and predictive mean matching, our proposed method can better incorporate the variability of missing data and enhance imputation quality. This paper is structured as follows. Section 2 describes the proposed MI through XGBoost framework (mixgb) in detail. Section 3.1 provides an overview of the simulation study that we used to evaluate imputation performance. The evaluation criteria for MI implementations are presented in Section 3.2. Simulation results are given in Section 3.3, demonstrating the imputation quality of mixgb. Section 4 demonstrates the advantage of mixgb over other implementations in terms of computational efficiency, and Section 5 provides an example using a real dataset. Finally, discussions based on empirical results are presented in Section 6. We have implemented the proposed method in our R package mixgb (Deng 2023), which is available at https://github.com/agnesdeng/mixgb and on CRAN.

Framework
Using XGBoost to impute missing data has attracted growing interest in recent years.
However, previous work has focused on the prediction accuracy of missing values without adequately accounting for the uncertainty of missing data, resulting in an underestimation of imputation variability (e.g. Zhang et al. 2019). To address this problem, we propose using XGBoost with subsampling and predictive mean matching (PMM) for MI.
Our R package mixgb offers settings that allow users to use subsampling with different ratios and choose the type of PMM. In this paper, we use mixgb-sub to denote using XGBoost with PMM and subsampling, and mixgb for using XGBoost with only PMM.
Unlike the R package mice, our imputation framework is non-iterative. However, users can set the number of iterations in our package mixgb to be greater than one. In Section 3, the imputation performance of both mixgb and mixgb-sub was examined using a single iteration, whereas mice was evaluated with five iterations.
Our framework mixgb-sub works as follows: Given an incomplete n × p dataset Y raw , we first sort our variables by the number of missing values in ascending order. We then conduct an initial imputation of the missing values to obtain a complete sorted dataset Y .
We let Y obs i and Y mis i be the observed values and the imputed (originally missing) values for the variable Y i . We also let Y obs −i denote the corresponding data in all variables other than Y i for entries where Y i was observed. Similarly, Y mis −i denotes the data for variables other than Y i for entries where Y i was originally missing before the initial imputation.
We use XGBoost models with subsampling to better account for the uncertainty of missing values. This is analogous to sampling parameters of a parametric model from their posterior distributions. For each of the M imputations, we generate a subsample Y * from Y . For each variable we fit an XGBoost model using the subsample and use it to obtain predictions for Y mis i , which we denote byỸ i * mis .
However, for continuous data, simply imputing missing values with point predictions Y i * mis is prone to underestimating imputation variability. This problem can be mitigated by PMM, which was first proposed by Rubin (1986) and extended to MI by Little (1988  The detailed algorithm mixgb-sub is described in Algorithm 1 and illustrated in Figure 1.
Without subsampling, mixgb works in a similar way. However, the entire dataset is used for each imputation. Note that omitting subsampling will reduce the variability between imputations and the primary source of between-imputation variance will be due to the differences between the PMM donors. If the number of PMM donors is set to one and subsampling is not used, then each of the M imputations will be identical. Figure 2 illustrates the need for using PMM for imputing continuous data with mixgb.
We created a dataset of 1000 observations using y i = 5x i + i , where x i ∼ N (0, 1) and i ∼ N (0, 1) for i=1, 2, ...., 1000. We then generated 50% missing data in the explanatory variable x under the missing completely at random (MCAR) mechanism. We refer to simulated observations chosen to be missing as "masked true". As shown in Figure 2, the variability within an imputed dataset using mixgb without PMM was considerably less than the variance of the observed data and the variance of the masked true data. In contrast, both Type 1 and Type 2 PMM had similar within-variance to the masked true data. type of PMM (By default, Type 2 for continuous data and no PMM for categorical data).
Initialization: sort variables by the number of missing values in ascending order; make an initial guess for missing data and obtain a complete sorted dataset Y init ; for i = 1 to p do fit an XGBoost model to Y (t) * to obtainβ * ; : Scatter plots of y versus x for observed data, masked true data, imputed data using mixgb without PMM, imputed data using mixgb with Type 1 PMM, and imputed data using mixgb with Type 2 PMM. It demonstrates that without the use of PMM, the within-imputation variance of mixgb is considerably lower than the variance of both the observed and masked true data.

Simulation studies 3.1 Overview
We generated a single complete dataset using data generation model (3.1). For each simulation run, we generated missing data via a missing at random mechanism (MAR) and obtained M = 5 imputed datasets through different MI methods.
Data generation. A dataset with mixed-type variables was generated using the following data generation model for i = 1, 2, ..., 10000: where the i were from a standard normal distribution. The variables norm1, . . . , norm8 were drawn from standard normal distributions, bin1 was drawn from a binary distribution with p = 0.5, and ord1 was drawn from a binomial distribution with n = 2 and p = 0.5. The distributions of norm1, norm2, norm3 and norm4 had pairwise correlations of 0.5. The binary variable bin1 was correlated with both norm5 and norm6 with correlation ρ ≈ 0.55, while norm5 and norm6 had correlation of 0.7. The ordinal variable ord1 was correlated with both norm7 and norm8 with correlation ρ ≈ 0.65, while norm7 and norm8 had correlation of 0.7. Note that norm4, norm6 and norm8 were included in the dataset as ancillary variables, but they were not used to generate Y .
Missing data mechanism. All variables, except for Y and the three ancillary variables norm4, norm6 and norm8, were made missing via an MAR mechanism depending on Y and one of the three ancillary variables.
Similarly, missing values in norm2 and norm3 were generated using the same mechanism based on Y + norm4. Missing values in norm5 and bin1 were generated using the same method, except with Y + norm6 in place of Y + norm4. Missing values in norm7 and ord1 were similarly generated based on Y + norm8.
Methods. We assessed the imputation quality of the following implementations: Default settings within mice (mice-default); Classification and regression trees within mice (mice-cart); Fast implementation of random forests (ranger) within mice (mice-ranger); XGBoost with PMM (mixgb); XGBoost with PMM and subsampling (mixgb-sub).
We used the default settings for methods from the R package mice, including the number of iterations maxit = 5, whereas the default number of iterations in mixgb and mixgb-sub is 1. We believe that our method has good performance even when the number of iterations is smaller than that of mice. The performance of all methods was evaluated over 1000 simulation runs. Our evaluation was based on empirical bias, within-imputation variance, between-imputation variance and coverage.

Evaluation criteria
Suppose that β is the true value of an estimand, such as the true coefficients of data and t ν,0.975 is the 97.5% quantile of a Student-t distribution with the degrees of freedom ν as defined in Rubin (1987).

Results
Simulation results are summarized in Table 2. We visualized the simulation results of biases and imputation variances in Figure 3 and Figure 4, respectively. The average computational time for each simulation run was 10 seconds for mice-default, 67 seconds for mice-cart, 68 seconds for mice-ranger, 11 seconds for mixgb, and 12 seconds for mixgb-sub.
To evaluate the biases of the imputation methods, we compared the estimates of coefficients obtained from different MI implementations against the empirical true estimates and estimates found using complete cases. These estimates are displayed in Figure 3. As can be seen, complete case estimates had the largest biases for most coefficients compared to MI. Among MI methods, mice-default performed worse than tree-based methods, especially for interaction terms. This is consistent with the results obtained by Doove et al. (2014). Tree-based methods had rather similar performances. Out of 14 coefficients, mixgb-sub obtained 8 least biased estimates, mixgb obtained 3 least biased estimates and mice-cart also had 3. In this study, mixgb-sub had slightly smaller bias than mixgb for most coefficients, but the differences between these two methods were small. shows that both mixgb and mixgb-sub obtained less biased estimates than mice-default and complete case analysis, and as good as or better than mice-cart and mice-ranger.
We evaluated the performance of imputation variance, as discussed in Section 3.2, by plotting the relative ratios Var W /Var target Additionally, mice-default had larger within-imputation variances than other methods, followed by mice-ranger, while mice-cart tended to have the smallest within-imputation variances.
In terms of between-imputation variance, we can see that mixgb without subsampling was lower than the target variance for all coefficients. This confirms our expectation that when subsampling is not used, mixgb will underestimate the between-imputation variance because it does not incorporate the uncertainty of model parameters. With subsampling, mixgb-sub was closer to the target variance, and it had the best performance overall. On the other hand, mice-default, mice-cart, and mice-ranger obtained higher between-imputation variances than the target values. Among these, mice-ranger had the largest relative ratio in 11 of 14 estimates, indicating that it might have overestimated the between-imputation variances.
As seen in Table 2, all methods had zero coverage for terms norm5, norm7, ord12, norm5:bin11, and norm7:ord11. This was not surprising given our simulation settings.
The coverage rates for the coefficients norm1, norm2 and norm3 were higher than other coefficients for all methods. Overall, mice-default had the lowest coverage rates for all interaction terms. Tree-based methods had similar coverage rates for most coefficients but mice-cart and mice-ranger had a considerably lower coverage for bin11. The coverage rates of mixgb-sub were slightly better than that of mixgb.   In this section, we compare the computational time for different MI implementations using real and simulated datasets. Both XGBoost and ranger provide multithreaded parallel computing; we used all 20 available threads for mixgb-cpu, mixgbsub-cpu and mice-ranger.
For mixgb-gpu and mixgbsub-gpu, only a single CPU thread was used with GPU support.
Since XGBoost with GPU uses tree method gpu hist, we use tree method hist for the CPU version in order to eliminate alternative explanations for the difference in computational time between them. The default number of iterations maxit is 5 for the mice package, whereas in our package mixgb, the default maxit is 1. In this comparison, we set maxit to 1 for all methods. In practice, mice uses a higher value of maxit and it will be slower.
Apart from these, we used default settings for all methods. All code used in this section is available in the supplementary materials.
First, we chose three real-world datasets of varying sizes to gain a sense of the imputation time for different MI implementations. For each dataset, 30% missing data was created under the MCAR mechanism in two variables.   and 1 categorical feature). The original dataset has 11 million samples and can also be obtained from the UCI repository (Dua and Graff 2019).
In this experiment, we only took the first 1 million samples.
We can see that for the datasets Credit and Higgs1M, which have mainly continuous features, mice-default was quite fast. However, it was much slower to impute Allstate, which has more categorical features. On the other hand, mice-cart was the slowest for all datasets. Compared to mice-cart, mice-ranger was much faster, yet it was slower than different versions of mixgb. With GPU support, mixgb-gpu and mixgbsub-gpu were faster than their CPU counterparts mixgb-cpu and mixgbsub-cpu for imputing large datasets Allstate and Higgs1M, but CPU versions were faster for imputing the smaller dataset Credit. Subsampling with mixgb took more time when just the CPU was used. However, subsampling required significantly less time when the GPU was utilised.
To further investigate the computational time of different MI implementations, we ran an experiment on simulated data. We generated datasets with sample sizes ranging from 10 3 to 10 6 , with the number of features being 11, 31, and 51. Standard normal data were created for continuous data. For categorical data, all explanatory variables had three levels and the response variable was binary. For each dataset, 50% missing data in three variables were created via the MCAR mechanism. We did not evaluate mice-cart here as it would have required excessive computation time even for medium-sized datasets.
The log average time to obtain five imputations for continuous data and categorical data is shown in Figures 5 and 6, respectively, with all results averaged over ten repetitions.
Overall, mice-default was the fastest for continuous datasets, but it performed poorly with increasing numbers of observations and features for categorical data. On the other hand, mice-ranger was reasonably fast for datasets with relatively small sample sizes; however, as the sample size increased, it scaled badly for both continuous and categorical data. To a certain extent, mice-ranger handled larger numbers of features more effectively than larger number of samples. Surprisingly, when working with the 10 6 ×11 dataset, mice-ranger took significantly longer to run compared to other methods. We will discuss this outlier later. All four versions of mixgb had good scalability across varying numbers of features and samples and outperformed mice-default and mice-ranger for large categorical datasets, where the GPU versions of mixgb were the fastest methods. When using multithreading CPU, mixgb-cpu and mixgbsub-cpu were faster than their GPU counterpart for smaller datasets, but mixgb with GPU support performed better for large sample sizes. Our experiment also showed that using subsampling with CPU increased imputation time, especially when the sample size was large. However, the additional time required for subsampling was minor with GPU support.  The computational time for mice-ranger on the dataset with 10 6 observations and 11 features may seem counter-intuitive. It was the only place where the runtime was larger with fewer features. We profiled the mice-ranger imputation process for the three datasets with 10 6 observations. We found that the time taken to fit the imputation model using ranger increased with the number of features, as expected, and the resulting tree for 11 features was simpler and shallower. However, the number of observations at each terminal node was much larger. When mice-ranger obtains predicted values from a given tree, it randomly samples one observation from the corresponding terminal node. For the dataset with 11 features, each tree had approximately 300 times as many observations at each terminal node on average, compared to the datasets with 31 and 51 features, which had similar numbers of observations at each terminal node. Since all datasets had 5 × 10 5 missing data points in each missing variable, the increase in time required to sample from terminal nodes was significant and outweighed the shorter time to build the tree. This effect caused the unusual time.

Data Examples
The NWTS (US National Wilms' Tumor Study) dataset (D'Angio et al. 1989) contains outcome variables (e.g. relaps and dead) as well as histology results for 3915 patients.
Since it provides both the central lab histology (histol) and local institution histology (instit) results, it has been extensively used as an example to illustrate two-phase sampling designs (e.g., Breslow and Chatterjee 1999;Breslow et al. 2009;Chen and Lumley 2020).
Other variables include study (3 or 4), stage (I-IV), the age and year at diagnosis, the weight of tumor (specwgt), the diameter of tumor (tumdiam), time to relapse (trel) and time to death (tsur). There are no missing values in the original dataset. We created missing data in histol, tumdiam and stage by an MAR mechanism dependent on relaps, instit, and specwgt. Missing data in histol and stage depended on relaps and instit; missing data in tumdiam depended on relaps and specwgt. Full details can be found in the supplementary materials. Five imputed datasets were then generated using mice-default, mice-cart, mice-ranger, mixgb and mixgb-sub. We fitted the following analysis model based on Kulich and Lin (2004): h(t) = h 0 (t) × exp (β 1 histol + β 2 age 1 + β 3 age 2 + β 4 stage + β 5 tumdiam +β 6 histol × age 1 + β 7 histol × age 2 + β 8 stage × tumdiam) , where β 2 and β 3 are separate slopes for age younger than one year (age 1 ) and age one year or older (age 2 ). The variable stage is a binary indicator. It is coded 1 to indicate stage III-IV and 0 to indicate stage I-II.   Figure 7 compares the marginal distribution of the observed data, the masked true data and multiply-imputed data using mean imputation in the variable tumdiam. In this case, there is no variability among five imputations. If we impute missing values by randomly sampling from the observed values of a variable, then the marginal distribution of that variable will resemble that of the observed values, which is shown in the top panel of Figure 8. The bottom panel of Figure 8, however, illustrates that random sampling imputation fails to capture the relationship between two variables.  On the other hand, Figure 9 shows that using MI methods can preserve the relationship between variables. The plot also indicates that mice-ranger had larger within and between imputation variability. Other types of diagnostic plots, which are omitted here, show similar patterns across all methods. mice-default, mice-cart, mice-ranger, mixgb, mixgb-sub.

Discussion
Standard MI implementations, such as mice-default, do not automatically account for interaction effects among variables in data. Applying tree-based algorithms such as CART and random forests to MI has been shown to alleviate this problem (Doove et al. 2014).
However, tree-based algorithms can be computationally demanding for large datasets.
In this study, we presented the design and imputation performance of mixgb, a MI implementation based on another tree-based algorithm XGBoost (Chen and Guestrin 2016), subsampling, and predictive mean matching (Little 1988). In our simulation study, we found that with subsampling, mixgb-sub achieved the least biased estimates on average than other MI implementations. Additionally, mixgb-sub performed the best in terms of between-imputation variance. We also demonstrated that all four versions of mixgb were slower for small datasets but scaled well for larger datasets, with GPU versions being the fastest. Subsampling increased the imputation time significantly when utilizing the CPU, especially for large sample sizes. Nevertheless, this increase was negligible when the GPU was used.
To the best of our knowledge, this is the first attempt to implement and evaluate MI using XGBoost with predictive mean matching and subsampling. We show that mixgb and mixgb-sub can achieve imputation performances as good as mice-cart, which was previously regarded as the best method for capturing complex data structures (Doove et al. 2014). We consider mice-ranger, mixgb and mixgb-sub to be better alternatives for MI, since CART is susceptible to overfitting and does not generalize well in real-world data scenarios. Moreover, using mice-cart for large-scale imputations can be time-consuming.
For smaller datasets, mice-ranger was faster than mixgb-cpu or mixgb-gpu; however, it was significantly slower for larger datasets. In addition, mice-ranger can take an excessive amount of time to run under certain circumstances, such as the case we described in Section 4. Our findings suggest that both mixgb and mixgb-sub are promising automated MI frameworks for large and complex datasets. In this study, We found mixgbsub-gpu to be the best among all evaluated MI methods due to its speed and imputation performance.
Our results may be limited to simulation studies similar to those in this paper and may not generalize to other scenarios. Under our simulation settings, a large proportion of missing values were created via an MAR mechanism that made complete case analysis invalid. If the data is missing completely at random or missingness is only weakly associated with the available variables, then complete case analysis can often achieve better results than MI methods. It is also worth noting that the speed and quality of imputation generally depend on hyperparameter settings, such as the number of trees in mice-ranger (default 10), the number of boosting rounds in mixgb (default 100), and the number of threads used when a multicore processor is available. Future work should evaluate imputation performance with other datasets and a wider range of hyperparameter configurations.

Computation details
The simulation study in Section 3 on imputation performance for each method was run in R 4.0.2 (R Core Team 2022) on a Ubuntu 18.04 server with a 64-core Intel Xeon Gold 6142 CPU processor and 187GB of RAM without the use of GPU. Simulation tasks for all methods were run in parallel.
The evaluation of computational time on real datasets and simulated datasets in Section 4 was run in R 4.2.1 (R Core Team 2022) on a 64-bit Windows 10 PC with 64 GB of RAM, an Intel i7-12700K GPU, and an Nvidia GeForce RTX 3080 Ti GPU. Each imputation method was tested separately to avoid interference between tasks. For more details, please read the file readme.md in the zip file.