Iterative Robust Semi-Supervised Missing Data Imputation

,


I. INTRODUCTION
In many real-world applications scientists are often confronted with the problem of incomplete datasets [1]. This phenomenon is particularly intense on medical, clinical data, industrial and survey data [2]- [4]. Incompleteness or deficiency [3] is a frequent phenomenon which refers to the presence of missing values in one or more attributes of a dataset due to a variety of reasons, including manual data entry mistakes, equipment faults, devise failure, inaccurate measurements during data collection, accidental deletion, non-response, admission limitations, unwillingness to provide personal information and so on [5], [6]. The analysis of datasets with missing values in attributes is infeasible in most cases, since conventional data methods are not directly applicable [3]. Even if such a method is workable, it inevitably results in inaccurate learning models and erroneous results [7]. On the other hand, knowledge quality and data quality are inextricably linked. Therefore, poor data The associate editor coordinating the review of this manuscript and approving it for publication was Kezhi Li . quality has a negative effect on both descriptive and predictive statistics [8].
A very interesting aspect of the missing values analysis concerns the mechanisms which result in missing data. The so-called ''missingness mechanisms'' [9] describe dependencies or non-dependencies between the distribution of an instance having one or more missing values and the distributions of observed and missing data [10]. Moreover, these mechanisms have a material impact on selecting the proper method for handling the missing data which occur in many real-world datasets [11]. We mainly consider three different assumptions of missing data [7]: Missing at Random (MAR), Missing Completely at Random (MCAR) and Missing not at Random (MNAR). MAR means that the probability of a single attribute value missing depends on the values of the observed data but not on the missing ones. In this case, the distribution of the observed data is the same as the distribution of the missing values [2]. Data are MCAR when the probability of a single attribute value missing is contingent neither on the values of the observed data nor upon the missing ones. Essentially, MCAR forms a special case of MAR [12]. In this VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ case, the distributions of the observed and missing data are different. Finally, data are MNAR if the probability of a single attribute value to be missing depends on the missing data. Practically, if data are neither MAR nor MCAR, then they are deemed to be MNAR [13].
Facing effectively the challenge of missing values is an essential step of the data mining process. Missing values can be handled through specific strategies such as deletion of partially recorded instances or supplementation with potential or actual values [12], a method referred to as imputation [9]. Deleting instances with missing values is somewhat a straightforward approach which, however, results to loss of valuable information [9]. Unlike deletion, imputation is a very common approach to overcome the shortcomings caused by missing data during the pre-process stage of a data mining task. Therefore, a plethora of statistical and machine learning methods have been proposed with a view to imputing missing values in incomplete datasets according to a specific technique [14].
Machine learning based imputation methods have been shown to be very effective for addressing the missing values phenomenon in recent years. The prevalent idea behind these methods is to train a classification or regression model based upon observed data and subsequently apply it to predict all missing values of the dataset attributes [10]. A plethora of familiar supervised and unsupervised methods including a variety of classification, regression and clustering techniques have been effectively used in a wide range of studies, as easily identified in the pertinent literature [15].
From that perspective, the main objective of this study is to propose an imputation method integrating the semisupervised learning (SSL) approach which is also known as ''weakly'' or ''incomplete supervised learning'' [16]. SSL methods have proved to be particularly effective for exploiting a small pool of labeled examples together with a large pool of unlabeled ones to improve learning performance.
In this context, unlabeled examples may be considered as a form of incomplete or partially labeled instances [17] as regards the missing values of the target attribute ( Fig. 1). Although SSL methods have been widely used for solving a variety of data mining problems, there is no similar work on the imputation field. The proposed algorithm, which we call Iterative Robust Semi-Supervised based Imputation (IRSSI), is a new hybrid imputation method based on robust semi-supervised ensembles, thereby harnessing the benefits of SSL. The experimental results on real-world benchmark datasets and artificially generated datasets, with respect to different and high ratios of missing data, demonstrate the efficiency of IRSSI algorithm compared to familiar imputation methods. Since SSL methods have not yet been used in the imputation process, we consider our proposal as an initial, yet promising step towards this direction.
The rest of this paper is structured as follows: In Section II, we discuss several methods for handling incomplete data, especially those concerning imputation methods. The proposed algorithm is described in detail in Section III, while the experimental procedure and the corresponding results are presented and analyzed in Section IV. Finally, the study concludes considering some thoughts for future directions.

II. METHODS FOR HANDLING MISSING DATA
A plethora of methods have been proposed to tackle the incompleteness problem, each one having its own advantages and disadvantages [18]. These methods can be grouped into two main categories: deletion and imputation.

A. DELETION METHODS
Dropping attributes or instances from incomplete data is considered to be a naive and convenient method for handling missing values, especially in the case where data are missing completely at random [10]. In the following paragraphs, we discuss the two major types of deletion methods: complete-case analysis and available-case analysis.

1) COMPLETE-CASE ANALYSIS
A very simple and commonly used method for handling incomplete data is to delete all instances having one or more missing values. Complete-case analysis (CCA) or case-wise deletion [11] is indeed a preferable and quite effective method for data analysts, especially in cases where missing data constitutes a small part of the whole dataset [19]. A general acceptable rule is to omit all incomplete instances if data is missing for less than a predefined threshold, for example 5% [20]. At this point, we should emphasize in the following extreme case: Suppose that each one of the n instances of a dataset with m attributes has only one missing value, while all missing values correspond to different attributes. This implies the deletion of the whole dataset, while the missing ratio is only 1/m. Nevertheless, although CCA yields a fully observed dataset which is available for further data analysis, it presents significant shortcomings that clearly affect its effectiveness. The first one is the loss of information, since it directly results to a subset of the initial dataset [9]. The second one is that the remaining subset is no longer representative of the parent population [12], while a bias is produced in the model if the missingness mechanism is not MCAR [10].

2) AVAILABLE-CASE ANALYSIS
Available-case analysis (ACA) or pair-wise deletion is another familiar deletion method which exploits instances with missing values in a flexible manner. More specifically, instead of dropping out an incomplete instance, the specific instance is used for analyzing the rest attributes with nonmissing values, thereby utilizing all available information [7]. In this case, different analyses of data are produced based on different subsamples of instances which are depended only on the attributes employed each time. A major disadvantage of this method is that it also leads to biased estimates if the missingness mechanism is not MCAR [11].

B. IMPUTATION METHODS
The term ''imputation'' refers to the process of replacing the missing values of instances in a given incomplete dataset with their potential or actual values according to a specific strategy [12]. Therefore, several statistical and machine learning methods have been proposed and employed with a view to approximating the missing values occurred in incomplete datasets as effectively as possible [15]. Regardless of the method used, imputation is considered both an essential and sensitive step of data preprocessing [10], which clearly affects the performance of the data mining task [9]. Imputation methods are usually categorized into three main classes: single, machine-learning based and multiple (Fig. 2).

1) SINGLE IMPUTATION
Single or univariate imputation [21] deals with methods for replacing one missing value for each attribute with only an imputed one [9]. Four commonly used single imputation methods are: mean and mode, regression, hot deck and expectation-maximization.

a: MEAN AND MODE IMPUTATION
A particularly efficient and widely used statistical approach of replacing missing values of numerical or categorical attributes is through the mean and mode imputation technique [22]. According to the mean approach, the missing values of a single numerical attribute are replaced with the corresponding arithmetic mean of the observed ones of that attribute, while the mode approach fills out the missing values of a discrete or categorical attribute with the most frequent observed value (i.e. the mode of the attribute values) [6]. A slightly differentiated approach is to replace missing values through the mean and mode approach based solely on the instances with the same output class, known as concept average value and concept most common value respectively [23]. Note that, in both approaches, missing values are filled up with estimated ones, which inevitably introduces an additional bias [7], especially when data are not MCAR.

b: REGRESSION IMPUTATION
In accordance with this method, a regression model is built from the observed data of a specific instance and subsequently is used to predict the values of the missing values of that instance. The regression model (linear or non-linear) that best fits on the data depends on the nature of relationships between attributes [20]. Linear regression is usually applied VOLUME 8, 2020 for estimating the missing values of numerical attributes, while logistic regression or multinomial logistic regression is usually used for estimating the missing values of categorical ones [15].

c: HOT DECK IMPUTATION
Hot deck imputation is based on similar but complete instances of data for replacing the missing values of incomplete ones [20]. A considerable advantage of hot deck imputation is that, it does not alter the distribution of observed data after the imputation process, unlike mean and mode imputation [9]. A very similar approach, namely cold deck imputation, is to make use of similar complete data coming from an external source [9].

d: EXPECTATION-MAXIMIZATION
The Expectation-Maximization (EM) algorithm is an iterative method for imputing missing values in incomplete numerical datasets, originally introduced by Dempster et al. [24]. The concept behind the EM algorithm is ''impute, estimate and iterate until convergence'' [9]. Each iteration consists of two steps: expectation and maximization. The expectation step concerns the estimation of missing values given the observed data, while, in the maximization step, the current estimated values are used to maximize the likelihood of all the data. The estimated values are updated, the two steps are iterated until convergence of the maximum likelihood of data, and the final estimates are used as the imputation values [15].

2) MACHINE-LEARNING BASED IMPUTATION
Machine learning-based imputation [25] concerns the process of building a predictive learning model based on the observed data for estimating the values of the missing ones [20]. In particular, the attribute with missing values is considered to be the target attribute, while the rest ones are used to train a learning algorithm which is subsequently used to predict the unknown missing values [22]. According to [15], clustering, k-Nearest Neighbors (k-NN), decision trees and Random Forests are the top four machine learning-based imputation methods, while a plethora of machine learning based imputation methods are presented and discussed in [10] and which fall into five main categories: clustering, k-NN, decision trees, support vector machines and Artificial Neural Networks imputation methods. A short overview of some of the most characteristic machine learning imputation methods is presented below: It is a simple and quite effective similarity-based imputation approach which relies on the k-NN technique [8]. For each missing value of a specific instance, the k most similar instances are selected according to the shared non-missing values and a predefined similarity measure (e.g. Euclidean distance, Manhattan distance or Minkowski norm). For categorical attributes, the imputed value is the most common among the k most similar instances, while the average value is used for numerical ones [10]. A slightly modified approach is the weighted k-NN method [26], that weights the distances of neighbors (weighted average) on the basis of a similarity measure. Obviously, both k-NN approaches are typical hot deck methods [20], whereas their effectiveness depends on the appropriate selection of the k parameter, which is often empirically selected [27].

b: DECISION TREES IMPUTATION
Decision trees form a commonly used supervised approach for imputing missing values in attributes. In general, a classification or regression tree is built for each attribute trained on observed data, which is subsequently used for estimating the missing values of a particular attribute [28].

c: CLUSTERING IMPUTATION
Typical clustering methods, such as k-means, hierarchical clustering [29] and k-means clustering with weighted distance [30] have been generally employed to improve the imputation performance in incomplete datasets. However, clustering methods are not robust enough to missing data [26].

3) MULTIPLE IMPUTATION
Multiple imputation or multivariate imputation [21] or repeated imputation [4] deals with methods for replacing one missing value for each attribute with k > 1 different imputed ones, thus creating k simulated and complete datasets which reflect the uncertainty of missing data [9]. More specifically, for each missing value, the estimated values are stored in a 1xk row vector, while the corresponding components of vectors along with the observed data constitute the k simulated datasets [3]. Imputation of missing data may be carried out by applying a specific technique, such as regression model, or even a sequence of regression models, such as linear, logistic and Poison, as has been shown in [31]. The k simulated datasets are analyzed separately, and the results are finally combined. Multiple imputation is a statistical technique which was originally proposed by Rubin for handling the problem of non-response in sample surveys [3]. The idea behind the creation of k multi-imputed datasets is to reflect both variation inside a single imputed dataset and sensitivity of inferences from the k different imputed datasets [32], contrary to the single imputation approach. Unfortunately, multiple imputation approach is more complex, time demanding and requires large data storage capabilities [9]. An important issue in multiple imputation is the appropriate selection of k, which is usually set equal to 3 or 5 [33].
To recapitulate, a vast array of differentiated methodologies has been put forward for efficiently handling the missing values problem as it can be promptly signaled out in the pertinent literature. These methods may be sorted out into different types [15]: simple and straightforwardly applicable approaches such as deletion methods, statistical methods, machine learning based methods and hybrid methods such as the one in [34], which combines C4.5, a well-known decision tree classifier, with the expectation maximization method. Even if the last ones have emerged recently in the imputation field, their effectiveness has been amply demonstrated in a wide range of studies. At this point, we should pinpoint that there is no universal imputation method that performs best for all datasets [35]. Utilization of datasets with different structure, the difference lying in the number of instances and attributes, as well as the percentage variation of missing data make it slightly difficult to recognize a widely approved method.
Motivated by the recent trend concerning the machinelearning based imputation methods, we propose a predictive model aiming to estimate missing values in incomplete datasets utilizing the SSL approach.

III. THE PROPOSED METHOD
As stated above, the main objective of this study is to present an imputation algorithm incorporating the SSL approach. The proposed algorithm is established on the basis of the Iterative Robust Model-based Imputation (IRMI) [21] algorithm. The IRMI algorithm is an improved variant of the Sequential Regression Multivariate Imputation (SRMI) approach proposed in [31], an effective and quite robust imputation technique for complex data structure, especially when data are MAR or MCAR. To harness the potential of SSL, two selftraining techniques are employed within the attribute fitting loop of the IRMI algorithm, thus constructing a new hybrid imputation method based on robust semi-supervised ensembles, which we now call Iterative Robust Semi-Supervised based Imputation (IRSSI).
Simulating the IRMI algorithm behavior, IRSSI is an iterative algorithm which loops through all available attributes of a dataset, setting each time one of them as response attribute and all the others as independent ones. Essentially, the response attribute in each iteration is the dataset feature which is going to be imputed by the algorithm. The proposed IRSSI algorithm is introduced below in 7 basic steps.
Step 1: The missing values of a specific attribute are initialized by replacing them with the mean or the mode value of the observed ones, whereas, at the same time, the original positions of missing values are recorded.
Step 2: The attributes are sorted in ascending order according to the total number of missing values in each one. For simplicity, we consider the following notation for the sorted attributes: where M (x i ) denotes the number of missing values for the attribute x i . In addition, let I = {1, . . . , p} denote the set of all attribute indices.
Step 3: A pointer l = 1 is initialized and used as an attribute index.
Step I \{l} and X m l I \{l} related to attribute x l . The X o l I \{l} matrix together with the response x o l l attribute constitute the labeled set L 0 l of observed attributes for the current l. Using the same notation, X m l I \{l} matrix along with x m l l constitute the current unlabeled set U 0 l for l. According to the type of attribute x l , two different procedures follow: I. If x l is a numerical attribute the sets L 0 l and U 0 l are passed in a multi-scheme semi-supervised regression (SSR) procedure [36]. This scheme utilizes three regression algorithms (hereinafter referred to as regressors) in order to efficiently augment L 0 l with the U 0 l instances [37]. II. If x l is a categorical one, the sets L 0 l and U 0 l are passed in a self-trained classification procedure, an improved variant of [38], to augment L 0 l with the U 0 l instances. The multi-scheme SSR procedure and the self-trained classification are briefly described below.
Step 5: According to the type of x l , response x m l l is computed utilizing the internal base learners defined in procedures I or II, trained on the augmented labeled set (L l ) produced in step 4. In case of a numerical x l , the averaged predictions of the three regressors are used as response values.
Step 7: Repeat steps 3 to 6 until the imputed cells are steady, according to the type: wherex m l l,i is the i-th imputed value of the current iteration andx m l l,i the previous imputed value. The constant ε is a small convergence parameter.
Speaking of the regression and classification procedures outlined on step 4, they both utilize the semi-supervised method of self-training [39]. The following paragraphs discuss briefly their logic flow.

A. MULTI-SCHEME SSR PROCEDURE
Starting with the first procedure, a multi-scheme SSR algorithm is employed using L 0 l and U 0 l as input. The applied algorithm is a variant of [37] and is based on an ensemble of three base regressors (bRegS) which are combined in a self-training loop using an instance selection function (MRL) [37]. The employed base regressors are described below in brief.
• Random Forests (RFs) is a simple, powerful and robust ensemble method for both classification and regression problems [40]. RFs creates multiple decision trees using different and randomly selected subsamples of features for splitting each tree node and aggregates their results via majority voting. A main advantage of the RFs algorithm is that it can efficiently handle overfitting phenomena.
• M5 is a model tree algorithm proposed by Quinlan [42], which induces trees of multivariate linear regression VOLUME 8, 2020 models. M5 is very effective and can successfully handle missing values and high dimensional datasets) [43]. In brief, M5 learner grows regression trees with the leaves being themselves linear regression models.
In general, the combination of multiple regression models has a positive impact in the reduction of the generalization error. The selected models are widely referenced in the literature and are both efficient and robust. In each loop iteration, bRegS is trained on the current labeled set L iter (with L 0 being equal to L 0 l ). The trained models are then applied on U iter (with U 0 being equal to U 0 l ) and a matrix containing the predicted values is generated. Subsequently, the matrix is sorted using MRL and a percentage (T ) of the unlabeled observations is added on L iter , and removed from U iter , using as target values the average predictions of bRegS for each observation. After the termination of the loop the augmented labeled set L l (≡ L iter=last ) is constructed.

B. SELF-TRAINED CLASSIFICATION PROCEDURE
The second procedure is used to exploit categorical attributes and is based on the algorithm presented in [38]. The base learner used inside the self-training classification loop is RFs algorithm and was picked for consistency reasons. In the same manner, in each self-training iteration, RFs is trained on L iter and applied on U iter and the predictions are produced. The most confident predictions are obtained in a matrix (M MCP ) considering the prediction probabilities of the observations of U iter . Only a percentage (T ) of the most confident predictions is added on L iter (and removed from U iter ). After the selftraining exiting criteria are met, the augmented labeled set is contained in L l .
The pseudocode of IRSSI imputation algorithm is presented in Alg. 1, which summarizes the employed techniques.

IV. EXPERIMENTAL PROCESS AND RESULTS
Two basic approaches were used to validate the efficacy of the proposed algorithm. The first one is based on experimentation with real-world benchmark datasets. In the second, artificial datasets were constructed in order to further explore different aspects of IRSSI performance.

A. EXPERIMENTATION ON BENCHMARK DATASETS
The experiments were based on fifteen benchmark datasets from a variety of domain problems and were extracted from the UCI [44] repository, while a brief description of their structure is presented in Table 1. We considered datasets with different structure: datasets with mixed type of attributes (categorical and numerical) and datasets consisting only of categorical attributes or numerical ones. Moreover, we considered both binary and multiclass classification problems.
The experimental process consisted of four consecutive steps as illustrated in Fig. 3. Initially, each complete dataset was partitioned into ten equally sized folds using the 10 crossvalidation resampling technique, thus ensuring the same distribution in each fold as in the full dataset. Subsequently, each    fold was injected randomly with missing values employing the MCAR missingness mechanism. Three different proportions of missing values were considered in each dataset, hereinafter called missing ratio (MR), and in particular: 30%, 40% and 60%. The choice of missing ratio was based on relevant studies. These studies are primarily focused on small missing ratios, usually from 5% to 30%, while there is a lack of prior studies that consider missing ratios greater than 50% [15]. The next step was the simulation of missing values. This process was carried out employing six common and state-ofthe-art imputation methods which can handle both categorical and numerical attributes, and in particular: • The Mean/Mode method, which is regarded as one of the most representative baseline statistical missing values imputation techniques [15].
• The Fuzzy k-means (FKMeans) Clustering imputation method with the following values of input parameters: k = 3, m = 1.5 and the Euclidean metric as distance measure [45].
• The IRMI machine learning based imputation algorithm.
• The proposed IRSSI algorithm.
In addition, for assessing the performance of the imputation methods employed in the experiments, two popular classification algorithms, belonging to representative machine learning families, were finally trained in the simulated and fully completed datasets. Hence, the classification process relies on the assumption that the imputed datasets simulate the real ones [12]. The two classification algorithms deployed after the imputation process, were the following: • Rotation Forest (RF), a powerful ensemble of independent decision trees, based on feature extraction. Each base classifier is trained on a randomly selected subset of features, while Principal Component Analysis is applied to each one of the subsets [47].
• JRip, an implementation of RIPPER (Repeated Incremental Pruning to Produce Error Reduction), a very effective and interpretable rule-based induction learning algorithm based on incremental reduced error pruning [48].
After the imputation process, each classifier was trained on nine simulated folds forming the training set, while the rest one, complete but non-simulated, was used for testing the performance of the classifier. This process was repeated ten times, until all folds were used as test set, and the results were averaged [49]. Therefore, we computed the overall accuracy of each learning model, a commonly used metric for classification problems, which corresponds to the percentage of correctly classified instances. In fact, accuracy is considered to be one of the most weighty metrics for evaluating different imputation methods for classification problems [20]. According to [20], the best imputation method gives better accuracy results for a specific classifier and a predefined missing ratio. The complete experimental procedure of our study is illustrated in Fig. 4.
A total of 45 incomplete and different datasets were finally included in the experimental process (3 missing ratios for each one of the 15 datasets). The average accuracy results regarding the three different missing ratios considered are summarized in Tables III, IV and V, while the supreme values for each dataset are highlighted in bold (including ties). Moreover, the standard deviation results in each case are presented in the same tables below each dataset accuracy. For simplicity, we make use of the notations acc p , std p for the accuracy and the standard deviation measure respectively for a missing ratio of p%.
It is clearly shown that the IRSSI algorithm performs better than all other imputation methods for most datasets, regardless of the missing ratio and the classifier deployed after the imputation process. The total number of wins for each imputation method, according to the missing ratio and the classifier deployed after the imputation process, are shown in Table 2, while the best scores are bold highlighted.
In more detail: • Depending upon the missing ratio, IRSSI is found to prevail in all three scenarios. More precisely, IRSSI scores the highest accuracy values in 11, 14 and 16 datasets using a missing ratio of 30%, 40% and 60% respectively, followed by LLSimpute (7 and 6 datasets) and Fuzzy k-means (6 datasets).
• Employing the RF classifier in the simulated datasets to evaluate the performance of the imputation methods, it is observed that IRSSI performs better in 22 datasets, followed by LLSimpute (9 datasets) and FKmeans (5 datasets). Similar results were obtained in the case of JRip. The IRSSI algorithm obtains better results in 19 datasets, followed by IRMI (9 datasets) and LLSimpute (8 datasets).
• Considering the influence of missing ratio, we can see that the following inequality holds: as could logically be expected. Besides that, it is worth noting that the IRSSI efficiency is improved as the missing ratio increases, thereby confirming the potential of SSL.
• As regards the estimated deviations, we observe small values in most datasets for all missing ratios, meaning that the average accuracies are close enough to the real ones. In addition, an extensive statistical analysis of the results was carried out to confirm the performance of IRSSI. Therefore, we applied the Friedman non-parametric test [50] followed by the Li post hoc test [51] (significance level α = 0.05). Both statistical tests are commonly used for comparing the performance of more than two methods [52]. According to the Friedman test results (Tables VI-VIII), the three imputation methods were sorted from the best performer (lowest ranking value) to the worst one (highest ranking value) for VOLUME 8, 2020 each of the six scenarios (three different missing ratios and two different classifiers). It is clearly shown in these tables that IRSSI prevails in all scenarios, while the remainder methods consistently have the lowest scores.
Since the null hypothesis H o was rejected (i.e. the means of the results of the six methods are the same), the Li post hoc test was used for detecting the specific differences among them. Li's test is very powerful and produces better results than other tests, especially when testing multiple hypotheses. The post hoc results are also displayed in Tables VI-VIII using IRSSI as control method. It is worth noting that from the remainder methods, no one seems to prevail. To be more  specific, LLSimpute and IRMI perform relatively well for the first two scenarios (i.e. for 30% missing ratio), while they lag behind IRSSI. For the next two scenarios (i.e. for 40% missing ratio) there does not appear a method which can compete IRSSI. Finally, in the last two scenarios (i.e. for 60% missing ratio), FKMeans is also achieving good results.
In addition, the performance of IRSSI is higher from its main rival (i.e. IRMI), as demonstrated by the experimental results and the statistical tests. It is therefore evident that IRMI benefits from the integration of the self-training process employed within the attribute fitting loop, thereby yielding a more robust and accurate imputation method, especially in datasets with large proportion of missing values. So, it becomes clear that IRSSI can efficiently handle the deficiency phenomenon in incomplete datasets of different structure and missing ratio values.

B. EXPERIMENTATION ON ARTIFICIAL DATA
In addition, a series of experiments were conducted utilizing artificially constructed data in order to reveal the efficiency of IRSSI in comparison with its main rival (i.e. IRMI) and against LLSimpute. Therefore, five random generated artificial datasets were applied. For each dataset, a random fiveclass classification problem was constructed. The feature values of the datasets were drawn from clusters of points normally distributed about vertices of an n-dimensional hypercube, where n is the number of informative features. A set of general parameters were selected with a view to generating robust random datasets. Table 9 summarizes them.
Each artificial dataset is composed of seven features with five of them being informative and the rest of ones containing random noise. At the time of generation, all features where numerical, thus a discretization process was applied in five VOLUME 8, 2020  In order to compare the performance of the rivals, the average root mean square error (Mean RMSE) of the feature vector differences between each original and imputed dataset instance was calculated for each imputation algorithm according to the following formula: where n is the number of instances, k is the number of features and v i,j is the corresponding feature value. All categorical features were transformed to their one-hot encoding [53]    equivalents, thus the above equation could be easily applied. Since the generated artificial datasets were five, every error presented in this section on the figures is the averaged calculated error over the five artificial datasets.
In the first experiment, we compare the behavior of the algorithms over different missing value ratios. In more detail, the five original artificial datasets were injected with missing values in ratios varying from 20% to 60% and the resulting datasets were fitted using the three imputation algorithms. The computed mean errors for each ratio are presented in Fig. 5. The three algorithms are close in terms of generated errors for very low missing ratios, whereas IRSSI is steadily producing lower imputation errors as the missing ratio increases.
In the second set of experiments, we compare the performance of the three algorithms regarding the presence of outliers. Therefore, the original artificial datasets were injected with outliers in five different ratios ranging from 2% to 10% (Fig. 6-8) and accordingly were injected missing values (30%, 40% and 60%). There is a clear predominance of IRSSI confirming that ensemble schemes tend to better handle outlier values [54].
Moreover, in order to observe the imputation capability of IRSSI and IRMI, a sixth artificial dataset was generated containing only three informative features (two numeric and one categorical), three classes and a hundred samples. This dataset was injected with 50% missing values and was imputed using the two algorithms. The original dataset clusters along with the imputation-generated clusters can be observed in Fig. 9. For comparing the quality (compactness and separation) of the generated clusters, two comparison indices were applied on the two imputed datasets. The first index is the Dunn index (DI) [55], an internal cluster valuation scheme. Higher index values indicate better clustering and is calculated as follows: where c is the number of clusters, δ(X i , X j ) is the inter-cluster distance between clusters X i and X j , and X k is the intracluster distance of cluster X k . The second one is the Davies-Bouldin index (DBI) [56], formulated in (5). The clustering quality is judged using quantities and features inherent to the dataset. Lower DBI values indicate better separation and tightness of the clusters. where δ(X i , X j ) and X i , X j as above symbolize the intercluster and instar-cluster distances accordingly. Table 10 summarizes the computed indices, which reveal a slightly better clustering behavior for the IRSSI algorithm.
Finally, a meta-dataset was constructed with a view to extracting meaningful rules regarding the performance of IRSSI in connection with the structure of datasets. To this end, Tables I, III, IV and V were combined, thus producing the meta-dataset. The derived binary class feature indicates whether the IRSSI outperforms the rest compared methods in each case. The rules were automatically extracted using Decision trees and RIPPER algorithms. In summary, two general rules were constructed: (1) If the dataset structure is mainly consisting of nominal features, then IRSSI displays strong performance characteristics. (2) The performance of IRSSI is significantly increasing as the missing ratio increases.

V. CONCLUSIONS
In the present study, we proposed a new hybrid imputation method based on robust semi-supervised ensembles. The Iterative Robust Semi-Supervised based Imputation algorithm (IRSSI) is a refined version of the IRMI algorithm, harnessing the benefits of SSL in a simple and efficient manner. The experimental results on fifteen benchmark datasets, using different and high ratios of missing data (30%, 40% and 60%) and two typical classifiers after the imputation process (RF and JRip), favor IRSSI compared to familiar imputation methods: the Mean/Mode statistical method, the Fuzzy kmeans single imputation method, LLSimpute, SVDimpute and the IRMI as the baseline method. Furthermore, the behavior of the rivals was examined on artificially generated datasets, considering a variety of missing value ratios and the presence of extreme outliers. The comparison between VOLUME 8, 2020 IRSSI, IRMI and LLSimpute verifies the superiority of the proposed method, as statistically confirmed by the Friedman non-parametric test and the Li post hoc test.
It is worth considering a few ideas to further improve the proposed algorithm. The first one concerns the utilization of parallel execution capabilities of the modern processing units. Several design changes in the flow of the algorithm (e.g. employ a more sophisticated flow for the calculation of multiple depended attribute responses at once) would enable IRSSI to impute the dataset's attributes in a more parallelized manner and increase its throughput. Another step on this direction is the modification of the inner procedures of IRSSI (step 4) to embrace prediction models that are suitable to be executed in GPUs. Such advancements could make the algorithm suitable for big data analysis or data streaming problems in combination with deep learning methods.
In addition, the investigation of the performance of IRSSI on tackling other machine learning problems seems an interesting area for future research. For example, the examination of algorithm efficacy when applying an imputation method together with clustering algorithms like density-based spatial clustering (DBSCAN) [57] or balanced iterative reducing and clustering (BIRCH) [58]. Furthermore, the application of IRSSI as imputation method to enhance regression datasets could also increase the data correlation on regression or even on time series-based problems.
Finally, embedding filters for handling outliers and extreme values for the imputed data, would have an immediate positive effect on the accuracy of the IRSSI. Filtering algorithms, such as local outlier factor [59] for detecting anomalous values based on neighboring data or Isolation Forest [60], a tree-based outlier detector, can be a perfect fit for application within the proposed algorithm.

APPENDIX
A full implementation of the IRSSI algorithm was developed in java and implemented for the WEKA [61] software tool, which offers a user-friendly graphical interface and supports a plethora of classification, regression and clustering algorithms. Our implementation is publicly available as a separate package at https://github.com/fazakis/semisupervised-missing-values-imputation-weka-package, while the algorithm is located under the filters menu of WEKA.