THE EFFECTS OF MISSING DATA CHARACTERISTICS ON THE CHOICE OF IMPUTATION TECHNIQUES

One major characteristic of data is completeness. Missing data is a significant problem in medical datasets. It leads to incorrect classification of patients and is dangerous to the health management of patients. Many factors lead to the missingness of values in databases in medical datasets. In this paper, we propose the need to examine the causes of missing data in a medical dataset to ensure that the right imputation method is used in solving the problem. The mechanism of missingness in datasets was studied to know the missing pattern of datasets and determine a suitable imputation technique to generate complete datasets. The pattern shows that the missingness of the dataset used in this study is not a monotone missing pattern. Also, single imputation techniques underestimate variance and ignore relationships among the variables; therefore, we used multiple imputations technique that runs in five iterations for the imputation of each missing value. The whole missing values in the dataset were 100% regenerated. The imputed datasets were validated using an extreme learning machine (ELM) classifier. The results show improvement in the accuracy of the imputed datasets. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and ELMs.


Introduction
Missing data/values describe the absence of important data items in instances of datasets. Data are collected at various points for medical investigation. Lichman [2013] observed two possible types of databases in the medical domain.
2 Alade, O. A.; Selamat A.; Sallehuddin, R. The first type is basically for hospital information systems, which consists of a vast number of attributes. Many of such attributes are not directly required for the diagnosis of the ailments in the patients. The other type of medical database is collected by experts. The databases may contain unique research data on topics based on hypothesis propositions that must be investigated. Missing data is very much pervasive in either of these types of databases, as in many other databases, and most real-world data analysis tasks. García-Laencina, Sancho-Gómez, & Figueiras-Vidal, [2010]; Tran, Zhang, Andreae, Xue, & Bui [2017] observed that 45% of datasets in UCI machine learning repository are marred by missing values, most of which may fall in the category of medical data. The pervasive nature of missing data has been described as one of the most challenging tasks in data science [Baraldi & Enders, 2010]. The missing observations usually have the potential to be captured but were not captured due to some reasons that may arise from the patient or the medical personnel. In Gao, Liu, Peng, & Jian [2015], two basic patterns of missing data were considered. The patterns are: (a) missing features, a situation when the features exist but values were not taken; therefore the information is lost, or the cost of acquiring the feature is high; and (b) missing label, when the label is inherently missing -that is, a problem that cannot be avoidedwhich eventually affects the performance of classifiers in the face of ever-growing datasets [Gimpy, 2014] in this information age. Missing data affects some operations in medical research. It makes it challenging to extract useful information from datasets; also, feature selection is not always applied to datasets with missing values [Tran et al., 2017].
Missing data is a complex pattern problem that is inherent in equipment malfunction during data extractions, sampling, transcription, transmission, and noise during data pre-processing [Gao et al., 2015]. Newgard & Lewis [2015] observed that missingness of data in clinical research could be a result of variables that are complex, time-sensitive, resource-intensive, or method of collection of longitudinal data. There are other causes of missing values in datasets observed in [Gimpy, 2014]. They are (i) Possibility of the observations been irrelevant, especially in medical data collection, because the primary reason for data collection is for medical investigation and diagnosis, and not really for research purposes. (ii) Inability to record the values when they were collected due to the emergency situation; or patient response avoidance from the respondent for privacy. (iii) The omission of essential features during the data collection plan. (iv) Non-capturing of seemingly available values. Two significant problems envisaged in Zhu [2014] with the presence of a missing value in datasets are (i) reduction in overall statistical power and (ii) statistical bias estimation.
The problem of missing data is being debated for some time [Joseph, 2016]. The best method to fix missing values is to revisit the data collection/extraction process to recollect and correct missing values and noisy values, respectively. This may involve the re-investigation of patients from various examination units to fix the missing values, but this method of recollection/re-extraction might not be practicable. Therefore, there is a need for fitting techniques with close approximations to the real data. The resolve, however, is that the imputation of missing data has no definite solution [Kenward, 2013].
The accuracy of classification from the incomplete dataset is unreliable because some vital information relevant to the analysis might be lost. Besides, some classifiers find it challenging to run in the face of missing values [Gautam & Ravi, 2015]. Therefore, there is a necessity for missing value imputation. The focus of imputation is to estimate the possible value of the missing data from observed data and fixed the estimated values as a replacement of the lost values. That is, to ensure complete datasets.
Several approaches have been used in research for the treatments of missing data in datasets. Some authors used the percentage of missingness in a dataset as a measure for the choice of imputation technique [Joseph, 2016]. Some implementations treat missing data implicitly. This brings about The effects of missing data characteristics on the choice of imputation techniques 3 different results when such treatments are replicated using different applications. Although the difference may not be significant, however, these approaches compromise the scientific soundness of the studies. An explicit approach to handling missing data is a better practice. In Zhu [2014], it was observed that the choice of missing data handling imputation technique depends on the research focus, whether it is a pragmatic or an analytical approach. Although missing data is pervasive in data science generally, in this paper, we focus the study on the effect of missing values on the medical datasets as an offshoot to application areas.
From this point on, the paper is organized as follows: Section 2 reviews relevant work on medical datasets with missing data and some techniques of missing data imputation; section 3 describes the characteristics of missing values, that is, mechanism of missingness; various treatments of missing values in datasets are discussed in sections 4; section 5 explains the proposed multiple imputation model; section 6 reports the experimental setup; discussion of the results of imputation on Pima Indian diabetes dataset is in 7; while section 8 concludes the paper.

Review of Literature
Medical datasets have been classified by many researchers. Some of the datasets are complete, while some are incomplete. Several studies had been carried out on missing data and imputation techniques from different perspectives. Some authors explicitly treat missing values using different imputation techniques, while some are passive about it, leading to the assignment of zeros (0), deletions of cases/features, or completely ignore missing values. ELM is widely used in recent time to solving classification, clustering, compression, forecasting, and regression problems [Alade, Selamat, & Sallehuddin, 2018] because it tolerates quite a good number of feature mapping functions such as sigmoid, hard-limit, Gaussian, multiquadratic, wavelet, Fourier series, etc., and it handles large and small datasets efficiently [Huang, 2015]. Subbulakshmi & Deepa [2015] proposed a machine learning paradigm by integrating particle swarm optimization (PSO) technique with extreme learning machines (ELM) to classify some medical datasets. The hybrid system performs well compare to other classifiers; however, missing values in the datasets were substituted with zeros. This approach is scientifically unfit for accurate results. Zeros do not represent a good imputation of missing values. Bai, Mangathayaru, & Rani [2015] overviewed the hidden challenges of missing values in medical datasets during pre-processing. They proposed the imputation of the missing values in the medical datasets with categorical attributes, the causes and pattern of missingness in the datasets were, however, not considered. This may result in the wrong choice of imputation technique.
An extensive review was carried out in [Armina, Mohd Zain, Ali, & Sallehuddin, 2017] on missing value imputations. The authors provide a detailed analysis of various imputation techniques. They grouped imputation techniques into four (4) broad categories: global, local, hybrid, and knowledge assisted approaches, but there was no experiment conducted to prove any of the imputation techniques discussed in their study. Gaussian mixture model and extreme learning machines (GMM-ELM) was proposed as a reliable approximation technique for imputing missing data by Sovilj et al. in [Sovilj et al., 2015]. The results of their work improved the imputation of missing values over the mean imputation technique; however, the execution time was longer. Bai et al. [2015] addressed categorical attribute missing values in medical datasets using imputation measure. The work, however, used a hypothetical dataset with only nine (9) cases and only two (2) missing values; the characteristics of the missingness was not considered in work. In Tsai & Chang [2016], Tsai & Chang investigated the effects of filtering outliers from datasets on imputation tasks using instance selection on categorical, numerical, and mixed type attributes. The effectiveness of the method was tested with k-NN and SVM classifiers. To compare the performance of three Bayesian imputation techniques, [Austin & Escobar, 2005]  Carlo simulation model was used to examine the performance of the sibling models. The result showed that mean and mean square error of logistic and Bayesian models depend on risk factors examined, and the mechanism of missing data been used. This gives an insight into the necessity for consideration of the mechanism of missingness for the right choice of imputation technique. Multiple imputations with Pohar-Perme method was used in [Falcaro & Carpenter, 2017] to estimate the net survival for stage-specific colorectal cancer. They concluded that the interpretation of datasets with a high percentage of missing values should be cautious and should be with sensitivity analysis. However, the characteristics of the missingness of data in the dataset were not taken into consideration before the choice of the imputation technique.
In Tang, Zhang, Wang, Wang, & Liu [2015], a hybrid imputation method based on the integration of Fuzzy C-Means (FCM) and the Genetic Algorithm (GA) for missing traffic volume data was developed. The study based the estimation on inductance loop detector outputs. The result, under prevailing traffic conditions, performed better than conventional methods. All these methods and much more in literature underscore the need for imputation of missing data in a given dataset with missing values. Although most of the imputation techniques mentioned above attempt to fill the missing values by approximately conforming to the distribution of the datasets, however, the methods of the imputation of values are not explicitly modeled; therefore, further analysis is ignored [Sovilj et al., 2015] and thereby lead to bias result. Nguyen, Carlin, & Lee [2017] raised some critical points to consider when constructing an imputation model. These are (a) model imputation functional form, (b) feature selection for the model, (c) inclusion of non-linear relationships in the model, and (d) the best way to handle non-normal continuous features. Nguyen concluded that there is no consensus in literature on how to implement these decisions, these could be evaluated from the nature of the missing values in the datasets. Therefore, there is a need to know the nature of missingness in a dataset for the right choice of imputation technique. In the next section, we attempt to have an overview of the possible nature of missingness in datasets.

Mechanisms of Missingness
The focus of this section is to look at the characteristics of missing values in datasets. These characteristics determine the causes of the missingness in the dataset. It is good to know the cause(s) of missing values in a dataset to handle the missingness appropriately [Liu & Gopalakrishnan, 2017]. Some literature refers to this as mechanisms of missingness. Various mechanisms of missing data values abound in literature [Diaconis & Efron, 1983;Falcaro & Carpenter, 2017;Huang & Chen, 2007;Shang & He, 2015;Zhu, 2014], but the most popular ones are basically three (3) which shall be considered for the purpose of this study.

Missing at Random (MAR)
This is a type of missingness that does not occur entirely at random; instead, they occur where there are other variables with complete information that can account for the missingness. It does not necessarily mean that the cases are similar to the complete counterpart. MAR is more realistic than missing completely at random (MCAR), and it is mostly applied to missing data imputation in many pieces of literature [Newgard & Lewis, 2015]. It is based on an ignorable assumption: that is, the available information is sufficient, and the assignment mechanism can be ignored. This case arises when some respondents decide to hide some information that is personal or are unpopular about themselves [Wasito & Mirkin, 2006]. Logistic regression with the outcome of 1 for the observed and 0 for the missing values is a reasonable option for its treatment. It can be statistically expressed as in (1) thus: for X random attribute and Z predictor attribute, if A C C E P T E D M A N U S C R I P T Accepted manuscript to appear in VJCS Vietnam J. Comp. Sci. Downloaded from www.worldscientific.com by 52.11.211.149 on 03/05/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

The effects of missing data characteristics on the choice of imputation techniques
5 then x distribution is not affected by values ∈ That is, when the missingness is based on the observed factors, then it is independent of the unobserved factors. Although, MAR in popularly accepted in many techniques, the result is still bias or yield imprecise results with simple imputation techniques [Newgard & Lewis, 2015].

Missing Completely at Random (MCAR)
MACR occurs if the cause of missing values of observable features and the parameters of unobservable features of interest are independent, and their occurrence is entirely at random. Analyses preform on MCAR datasets are unbiased, although this type of datasets is rare. It is the highest level of randomness. It is expressed as in (2) Any imputation technique can be applied [Tsai & Chang, 2016]

Missing Not at Random (MNAR)
MNAR is a type of missing data where there is a relationship between the missing data and the reason for the missingness. It occurs when the missingness depends on the probability of the actual value of the missing data [Gimpy, 2014] and some/all other observed data [Zhu, 2014] [Austin & Escobar, 2005]. Tsai & Chang [2016] observed that this mechanism would be difficult to judge because the missing data are unknown.
The treatments of missing data should be based on the mechanism of missing data, as explained in the next section.

Treatments of Missingness
In the treatment of missing data, two broad approaches are conventional in literature. These are (a) omission of missing data, (b) imputing the missing data [Sovilj et al., 2015]. Some approaches to the treatment of missing values (MV) are outlined in Fig. 1 and later discussed below:

Omission
This approach simply deletes instances with missing data. The approach is common in some regression models, usually refers to as Lit-wise deletion, complete case analysis [Mukaka et al., 2016]. It is only valid under the following conditions: (a) the instances with missing values in the sample are negligible; (b) the pattern of the missing data is missing at random (MAR) or missing completely at random [Gautam & Ravi 2015]. This approach reduces the sample size, so they often lost vital information because the deleted instances may be essential and deciding factors for predictions and classifications [Sovilj et al., 2015]. It limits the study power. Therefore, the imputation technique is preferred to lit-wise deletion.

Imputation Techniques
Imputation can be categorized into two: (a) single imputation and (b) multiple imputations

Single Imputation
Single imputation uses mean, median, mode, or conditional mean like a predicted value from a regression function evaluation or decision tree [Tran et al., 2017;Wasito & Mirkin, 2006] to generate the data to be imputed only ones. Mean imputation is a standard method used to replace missing values [Tran et al., 2017]. The average or median of the observed feature values is computed and substituted for each missing value for numerical attributes; and mode for simple ones. Unfortunately, this method does not present an actual distribution of features. It underestimates the variance and ignores relationships among variables in the dataset [Austin, Escobar, 2005], [Falcaro & Carpenter, 2017]. These lead to complications of statistical inferences. Imputation of missing data is handled sequentiallyone-by-one. This method work with datasets with a limited number of variables [Wasito & Mirkin, 2006] The last observation carried forward: this method replaces every missing value with the last observation. This approach assumes that the result will not change after the last reading [Zhu, 2014]. It is a simple method, so it is accessible. It maintains the actual size of data; however, it might result in bias outcomes. This method is not analytical enough, so there is a need for a more comprehensive imputation technique.

Multiple Imputation Techniques
This method is scientifically plausible to replace values with the modeled results of several imputations that analytically represent the missing data. For example, the regression model will reflect the uncertainty of regression coefficients and the sample variables in the model. Multiple imputation techniques analytically create several values to replace the missing data [Gelman & Hill, 2007]. Different models also predict this replacement values. It must be known that the aim of multiple imputations is not to produce the actual missing value; rather, it attempts to generate scientifically valid results to account for the missing values [Zhang, 2016]. According to Rodwell, Lee, Romaniuk, & Carlin [2014], it is possible that the simulated values may not fall within the expected range. The single idea about multiple imputations, however, is to form complete datasets from the observed value analytically.
is the number of imputations carried out on the original dataset with missing values, and it produces N different complete datasets. Armina et al. [2017], further details on local, global, hybrid, and knowledge assisted imputation techniques are discussed, as sketched out in Fig. 1.
In the next section, we propose a regression model for multiple imputations used in this study.

Proposed Multiple Imputation Model
In this section, a regression model is proposed for multiple imputations of missing data, because it draws values randomly from donor instances to predict values that are closed to the missing values to be predicted; also regression creates inter-data variability using stochastic elements [van Kuijk, Viechtbauer, Peeters, & Smits, 2016]. The results of this data pooling produce correct standard error estimates.
For a dataset with missing data problem of Y = Xm+1 on m variables X1, …, Xm instances of a random sample when x are incomplete; the incomplete values can be estimated with regression model in (3): is the correlation coefficient, = ( 4 , … , & ) the relative size of the coefficient of regression to one another, and > is the intercept.
To make an inference of the regression coefficient, & of the missing values and the standard errors 4 … @ are obtained for each dataset in . The mean dataset estimates ( A ) is given in (4): The variation within (V) and between (W) the imputation is shown in (5) below:

Experimental Setup
In order to determine the choice of imputation techniques for a dataset, the Pima Indians diabetes dataset from UCI was used. Pima Indians diabetes

A C C E P T E D M A N U S C R I P T
Accepted manuscript to appear in VJCS Vietnam J. Comp. Sci. Downloaded from www.worldscientific.com by 52.11.211.149 on 03/05/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
The effects of missing data characteristics on the choice of imputation techniques 7 dataset contains clinical tests and diagnosis of Pima Indian women of 21 years of age and above with diabetes [Smith, Everhart, Dickson, Knowler, & Johannes, 1988;Strack et al., 2014]. The dataset is made up of integer and real number data types. It has 768 instances -8 predictive features and a class feature. The features in the dataset are number of time pregnant (V1), a 2hr oral tolerant test for plasma glucose concentration (V2), diastolic blood pressure (V3), triceps skinfold thickness (V4), a 2hr serum insulin (V5), body mass index (V6), diabetes pedigree (V7), and age (V8). The class (V9) is a binary classification dataset with 1 for positive and 0 for negative. The dataset was indicated to have missing values on the web page [Smith, Everhart, Dickson, et al., 1988], but the real dataset did not show such signs. However, a critical examination of the dataset shows that variables like plasma glucose concentration, body mass index, triceps of skinfold thickness, diastolic blood pressure, 2-hour serum insulin cannot be 0 for any instance; therefore, it was assumed that all the data values scored 0s are really missing values and were so treated in this study.
The missing values in the dataset were coded for proper identification by the imputation program. For an analytical presentation of the degree of missingness in the dataset, graphical, and numerical summaries [Nguyen et al., 2017] were used in our report. The percentage of missingness was calculated by instances, features, and values, as shown in Fig. 2.
The percentage of missingness among the instances was considered in this study as common in literature [Austin & Escobar, 2005;Tsai & Chang, 2016]. The missing values were analyzed in order to know the distributions of missingness among the various features that have missing values in the dataset. The distribution is shown in Table 1. A missing pattern describes the set of features in a dataset that has at least an instance in the dataset with a missing value(s) the same feature(s) in the pattern [Tran et al., 2017].
The pattern of missingness of the features in our dataset was analyzed. The result of the analysis is shown in Fig. Each   to the group of instances with the same pattern of incomplete and complete data.
The variables are sorted in increasing order of missing values (from the least missing value feature to the one with the highest missing value) from left to right. This is to enable us to know the type of imputation required to fill up the missingness. Multiple imputations are carried out on the original dataset to be able to come up with five sets of complete datasets. The analysis of the imputed datasets for features with missing values is shown in Table 2-6     As mentioned earlier, the original Pima Indians diabetes dataset could not be easily noticed as having missing data because every cell in the dataset is completely scored. However, the source (UCI database) categorically stated that the dataset has missing values.

A C C E P T E D M A N U S C R I P T
On a critical look, it was observed that the missing values in the dataset were scored zero, and it was so treated except the first feature (Number of times pregnant -V1), which is assumed can be zero among the selected women. Fig. 2 shows that 44.44% of the features in the data set is complete. That is, 4 out of 9 variables (class label inclusive) have complete data scored while 55.56% have incomplete (missing) values. Based on the instances (cases) in the data set, 51.04%, which is 392 cases, has complete data, while 48.96% (366 cases) have incomplete data. For the entire values in the Pima Indians diabetes dataset (that is the intersections of features and instances values), 90.57% of the cells have values while 9.43% are missing.
In Table 1, the distribution of missing values among the cases is presented. The table shows the five features with missing values in their quantities and in various percentages across the dataset. The features are arranged in descending order of the percentage constituents of their missingness: 2-hour serum insulin (V5) has the highest percentage of missing values (48.7%), triceps skinfold thickness (V4) has 29.6%, Diastolic blood pressure (V3) has 4.6%, while body mass index (V6), and Plasma glucose concentration (V2) has 1.4% and 0.7% respectively. Number of times pregnant (V1), Diabetes pedigree functions (V7), and Age (V8) have no missing values. Also, the number of valid values are shown along with their means, and standard deviation for all the features with missing values. Fig. 3 depicts the missing pattern of the features. The features are arranged from left to right with features with non-missing values on the left, to those with least missing values, through the ones with the highest missing values on the right hand. The missing pattern chart displays the value pattern for the analysis function. The pattern represents the group of instances that have the same pattern of incomplete and complete data. In that Fig. 3, Pattern 1 corresponds to instances with no missing value; Pattern 2 shows instances that have missing values on V2; Pattern 3 represents instances with values V6. Pattern 4 represents instances with missing values in V5; Pattern 5 shows instances with missing values on V2 and V5; Pattern 6 represents the missing values on V6 and V5, Pattern 7 is for missing values in V3 and V5; Pattern 8 represents instances with missing values on V4 and V5; Pattern 9 represents instances with missing values on V6, V4, and V5; Pattern 10 is for missing values on V3, V4, and V5; Pattern 11 is for missing values on V6, V3, V4, and V5. All the patterns show no missing values in V1, V7, V8, and V9. Although the dataset has the potential for 2 9 patterns [Lichman, 2013], only 11 feasible patterns are represented in 768 instances.
The features and patterns are arranged in an orderly manner to reveal the existence of monotonicity in the dataset. From the result of the patterns in Fig. 3, it is clear that the missingness in the dataset is non-monotone because all missing cells and non-missing cells are not contiguous; that is, the dataset is MAR as explained in section 3. There are many values to be imputed to achieve monotonicity. Therefore, the use of a monotone (single) method of imputation may not be plausible; the use of multiple imputation techniques (in section 4iii) becomes the needed option.
Multiple imputation techniques were carried out on the dataset features with missing data using (4). The five imputed features were V2, V3, V4, V5, and V6. The order of the imputations of the features was V2, V6, V3, V4, and V5 (in increasing order of the percentage of missingness). The imputation was complete for all missing values in each of the features, and there was no one that was omitted either as a result of 'too' many missing values or no missing value. The descriptive analysis of the imputation on each feature is shown in Tables 2-6.  Table 2 shows multiple imputations for feature V2. Five (5) out of 768 instances in the feature are A C C E P T E D M A N U S C R I P T Accepted manuscript to appear in VJCS missing. The missing values were imputed in 5 iterations, which is a total of 25 runs. The missing values were 100% imputed. The mean, standard deviation, minimum, and maximum values for the original data set and each imputation run is shown in its respective column. The same treatment is done for V3, V4, V5, and V6 in tables 3-6, respectively, in the appendix. Observing the characteristics of the datasets before and after the imputations from Tables 2-6, the statistical mean, standard deviation, minimum and maximum are more stable (not at much variance) after the imputation than the reduced datasets during imputation. For example, during imputation in Table  2, the mean imputations for the five iterations are 125.52, 119.22, 145.21, 111.99, 135.09; and the mean imputations for the five iterations of complete datasets are 121. 70, 121.67, 121.84, 121.62 and 121.77 which are closed to that of the original data. This shows that the imputed datasets are more reliable than the original form, especially in medical diagnosis, which deals with the issue of saving lives.
The results of the simulated datasets are shown in Fig. 4. The results of the ELM classification of the complete and incomplete datasets show the validation of multiple imputations of Pima India diabetes datasets. P0, the original uncomplete dataset has the least percentage accuracy of 63.0431, while all other five (P1-P5) imputed datasets perform better than the P0. This proof that multiple imputations are a better choice of imputation technique for a non-monotone missing dataset for better classification accuracy.

Conclusion
This paper proposed the examination of the effect of characteristics of missing data on the choice of imputation technique. The mechanism of missingness and the missing pattern were examined as the bases for a choice of imputation technique for filling missing data in the medical dataset with missing values. In this study, we considered the various mechanism for missing values -MAR, MCAR, and MNAR. Different treatments of the missing data based on the mechanisms of missingness were discussed. We backed up our study with investigations into the pattern of missingness in the Pima Indians Diabetes dataset. The observed pattern of missingness on the dataset showed that multiple imputations are more suitable to impute the missing values because it reflects the uncertainty of the undelaying missing data, and the imputations were so treated as shown in Tables 2-6. Our further research work will focus on the performance comparison of different classifiers on the imputed datasets, and suitable optimization technique on a favored classifier in order to improve the accuracy of classification. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and extreme learning machines (ELM).

A C C E P T E D M A N U S C R I P T
Accepted manuscript to appear in VJCS