Performance Analysis of Classification Algorithms on Birth Dataset

Generating intuitions from data using data mining and machine learning algorithms to predict outcomes is useful area of computing. The application area of data mining techniques and machine learning is wide ranging including industries, healthcare, organizations, academics etc. A continuous improvement is witnessed due to an ongoing research, as seen particularly in healthcare. Several researchers have applied machine learning to develop decision support systems, perform analysis of dominant clinical factors, extraction of useful information from hideous patterns in historical data, making predictions and disease classification. Successful researches created opportunities for physicians to take appropriate decision at right time. In current study, we intend to utilize the learning capability of machine learning methods towards the classification of birth data using bagging and boosting classification algorithms. It is obvious that differences in living styles, medical assistances, religious implications and the region you live in collectively affect the residents of that society. This motive has encouraged the researchers to conduct studies at regional levels to comprehensively explore the associated medical factors that contribute towards complications among women during pregnancy. The current study is a comprehensive comparison of bagging and boosting classification algorithms performed on birth data collected from the government hospitals of city Muzaffarabad, Kashmir. The experimental tasks are carried out using caret package in R which is considered an inclusive framework for building machine learning models. Accuracy based results with different evaluation measures are presented. Bagging functions including Adabag and BagFda performed marginally better in terms of accuracy, precision and recall. Improvements are observed in comparison to previous study performed on same dataset.


I. INTRODUCTION
Knowledge engineering and machine learning helps to simulate decision making activities in various fields like healthcare, manufacturing, image processing, prediction, customer services etc. Numerous machine learning algorithms are used for prediction and data classification. Healthcare organizations are using applications of information technology and machine learning in order to optimize the activities of operational decisions and respond to the needs of physicians. Machine learning helps in emergency medical situations, decision making activities, general primary care, The associate editor coordinating the review of this manuscript and approving it for publication was Alicia Fornés . and also help physicians to choose best operation methods when it is hard to predict the outcomes in diagnosing the patients [1]. Many researches are focused on the development of decision support systems that assist physicians to gain insights and predict different outcomes relevant to the area of interest [35]. However, some subtle areas seek attention that is major cause of death in low income countries for example pregnancy complications. These complications may expose expecting mother to diseases [38], [40] and cause situations demanding birth by caesarean section (C-section). A C-section is a method when the expected child is delivered unnaturally by going through some surgical procedure. A C-section becomes inevitable in case of multiple babies in uterus, baby in traverse position or in breech, previous S. A. Abbas et al.: Performance Analysis of Classification Algorithms on Birth Dataset C-section deliveries, substantial infant and so forth. The physicians recommend C-section if vaginal delivery may jeopardize the life of expecting mother, or child [2]. A very high rate of C-section is recorded worldwide, for example 23 million C-section were recorded in year 2012 only all across the world [3]. Other than this recorded figure, the number of deliveries conducted in homes, private clinics etc., is significantly high. Medically, it is believed that C-section affects the health of mother and caesarean child. Therefore, it is important to conduct researches that may help to discover the physical or related factors that contribute towards the situation that demands C-section. It is perceived that pregnant women sharing alike social life underpass through several similar pregnancy experiences and medical situations during their gestational period. These experiences and medical conditions factors of expecting women become handy in prediction based studies if collected and documented carefully. This could be helpful in two ways. Firstly, such data may help to generate predictions about the current subject under observation. Secondly, the predicted results may help the physicians to take immediate measures by dealing with the issues that are contributing towards situation demanding C-section. At this point, machine learning algorithms with data mining approaches come into play which is good at memorizing the hidden patterns in data and produce valued information [4]. Machine learning algorithms find hidden patterns from the dataset by creating relationships between dependent and independent variables. This relationship can be used to understand what affects the outcome variable and predict the outcomes of unobserved data. This process is called supervised learning for the classification in which the data is already labeled with predictor variable for establishing hidden patterns. Several supervised machine learning algorithms are developed to serve the said purpose of classification. Although, these classifiers are mostly used for the same purpose i.e., train dataset, generate predictions using test dataset and then evaluate results statistically in terms of mean accuracy. Keeping this in mind, the current study is conducted that contributes in different ways. Firstly, regional disparities may make it inappropriate to apply the outcomes of a study conducted for one region and considering it applicable for the people belonging to some other region [36], [37]. Therefore, we have collected regional birth data so that the findings may become applicable to the women of targeted region. Secondly, we have performed birth data classification using bagging and boosting machine learning algorithms to present cause analysis and suggest best classification model to predict birth outcome in terms of mean accuracy for our birth data. Thirdly, we recommend few suggestions in conclusion section that may help policy making institutions to revise their policy regarding pregnant women and her expected child. Lastly, this study is the first of its kind in the region that encouraged computing professionals and physicians to work together and flourish grounds for future research that will uplift the society.

II. RELATED WORK
The application of machine learning algorithm has gained popularity in different fields of computing [42]- [47]. The range of scholars motivated to investigate prognosis and diagnosis using machine learning methods is progressively increasing. This section refers few researches relevant to the current study. The authors in [5] studied the influential factor associated to neonatal and perinatal mortalities. They identified that neonatal mortality rate was 31.40 per 1000 live births and the perinatal mortality rate was 49.70 per 1000 pregnancies. Infections (43%) was discovered the main reason for neonatal deaths. Risk factors for death were C-section delivery (P = 0.049),preterm delivery (P = 0.003) and pregnancy with twins (P = 0.001). In [6], authors collected birth data from government hospitals of their city. They performed bi-variate analysis to identify the relationship between the maternal factors affecting the mode of the birth. They produced different classification models using random forests [7], support vector machine [8], linear discriminant analysis [9], neural networks [10] etc. They reported random forest (RF) to be the best classifier for their birth data in terms of mean accuracy and other evaluation criterion. Another such classification-based study including several machine learning classifiers was conducted by Robu and Holban [11]. A total of 2325 birth records were acquired from Bega Obstetrics and Gynecology Clinique. The study aimed at the investigation of relationship between the level of glucose in the blood from the umbilical cord, the newborns cry, the mother's body mass index before the pregnancy and the Apgar score. Several classification models were also built through dataset values. A dedicated Weka API based application was also developed to classify birth outcome by using Logit Boost algorithm. Another study aimed to investigate the influence of feature selection on the performance of a Naive Bayes (NB) classifier for fetal heart rate patterns and fetal states. They used four different feature selection methods including Correlation based, ReliefF, Information Gain, and Mutual Information. ReliefF performed better for fetal state classification, however no significant effect of feature selection methods was reported on the classification of fetal heart rate [12]. Fergus et al., [13], used a dataset of 552 intrapartum recordings to apply signal processing techniques to cardiotocography fetal heart rate traces to extract 13 features. They discovered that results are satisfactory with ensemble classifier including RF, SVM and Fishers Linear Discriminant Analysis. Warrick et.al., [14] modeled fetal heart rate baseline with a linear fit and fetal heart rate variability unrelated to uterine pressure using the power spectral density, computed from an auto-regressive model. They trained SVM with feature set of this model and a perinatal database including normal and pathological cases. Their method detected pathological cases with only 7.5% false positive rate. Dulitzki et al. [15] applied multiple logistic regression to study the association between C-section and maternal age. The findings indicated that expecting women of 44 years age or above VOLUME 8, 2020 have higher rates of medical complications that make the situation more probable to C-section birth. On the other hand, the rate of C-section among women between 22-29 years is lower than expecting women in older ages. Another such study was conducted by Goldman et al. [16], where the relationship between C-section deliveries and maternal age was evaluated using statistical analysis. The idea was to group the expecting women into different categories based on their ages. Three groups were devised including expecting women below 35 years, between 35-39 years and above 40 years. Statistical analysis revealed that women above 40 years were susceptible to placenta abrupt, placenta Persia and C-section delivery. The results further testified that women between 35-39 years were prone to miscarriages and chromosomes abnormalities. Furthermore, authors in [17], investigated the relationship between age, fetus weight and C-section. They applied T-test, Chi-square test and multiple logistic regression to investigate this relationship. They identified that C-section is more probable among expecting women over age 35 years and if the weight of expected child is 3600 grams or above. In [18], authors applied several validation techniques, statistical analysis and machine learning methods to analyze three datasets with 214 variables and 18,890 subjects. In order to assess preterm birth risks, they developed an expert system and claimed to be accurate than existing manual methodologies for preterm birth assessment. Another study used C4.5 algorithm to predict preterm birth risks during pregnancy [39]. Authors applied C4.5 algorithm on standardized and un-standardized pregnancy data. The percent accuracy for standardized and unstandardized data reported in study is 71.30% and 66.08% respectively.The Electro-hysterography (Uterine Electrical Signals) is monitored to predict the preterm births by the authors in [19]. The classification-based study was conducted on the dataset including 38 preterm and 262 term records. The authors observed improvements to the existing studies of the time. In [20], authors used RF in combination with a model based on the latent class analysis to define the classes. Study concluded that the proposed modeling methodology is promising and probably appropriate for developing decision support system. Another study [21] identified that the lower birth weight is associated with untreated coeliac diseases in parents. They used multiple linear regression for the adjusted analyses of differences in birth week, birth weight, and birth length. Binary logistic regression was used for calculation of Adjusted Odds Ratios (AOR) of the dichotomous variables preterm birth, low birth weight, C-section, and neonatal hospital care. A multivariate approach is followed [22] to understand the association between socio-economic factors and their influence on expecting women. They claimed that the emotional unbalance is associated to social support and life stress which cause the complications during gestational period and at the time of delivery. Most recent research appeared in region is found in [41] that is a classification study using several machine learning algorithms to predict cleft before birth. The data was collected from different hospitals of Lahore city. Data was collected from 1000 expecting women. Multilayer perceptron yielded 92.6% accuracy on test data.
The related work presented above witnesses the interest of researchers to discover solutions associating pregnancy complications. However, most of the work is carried out in the advanced countries of the world. In low income countries, much effort is yet required to ensure the safety of expecting mother and her child.

III. METHODOLOGY
The idea behind the machine learning process is to find the patterns in the training part of data in such a way that generalizes well to the test part of the data. In order to find patterns, an attempt is made to create a relationship between predictors and response variable. Once this relationship is established, one may focus at explanatory modeling with the intent to understand this relationship or this relationship may lead to predictive modeling where one may try to predict the response variable from predictor variables on unobserved data. This process represents the supervised learning scheme for classification which is concerned with establishing patterns in data providing predefined response variable and set of predictor variables. Several supervised learning algorithms or classifiers are developed to serve the said purpose of classification. Though these classifiers, most of the time used for the same purpose, that is to be trained and tested on specific datasets and then to evaluated statistically and/or on the basis of accuracy they yield. However, these classifiers may be categorized into different families on the basis of their method they adopt to perform classification. Few well known methods for performing classification includes decision trees, classifiers based on statistical learning, perceptionbased classification, support vectors and ensemble learning.
Decision trees may be depicted as a logical structure of branches and nodes where a node signifies a feature in associated instance and a branch denotes a value that a node can receive. The classification process is subject to sorting the instances based on feature values [23]. Statistical learners are associated with defining a probability model to compute the probability of the instance for its belongingness to the class. Some statistical learners i.e., LDA and Fishers Linear discriminants [24], attempt to separate classes by discovering the linear combination of features [25]. LDA works well with the continuous quantities, while dealing with categorical data discriminant correspondence analysis is supposed to perform better. NB is another powerful probabilistic classifier based on the assumption of independence among predictor variables [26]. NB calculates the posterior probability for each class and the class with the highest probability becomes the outcome. Another strong classifier under the umbrella of statistical learning is K-Nearest Neighbor (K-NN) which assumes that the instances in close proximity may have similar properties and any of unclassified instances label is deducible from labels of its neighbors [27]. K-NN uses distance metrics for example, Euclidean, Manhattan, Chebyshev, Minkowski etc., that attempt to minimize the distance between two instances that are similarly classified and maximize the distance between instances of dissimilar classes.
Contrary to K-NN, SVM attempts to reduce the generalization error by maximizing the distance between the margins i.e., either side of the hyperplane that separates the instances [8]. In case of linearly separable data, once the finest hyperplane originates, the data points on margin becomes the support vectors. Linear combination of these support vectors is then used to produce results. In case of nonseparable data, it is hard to find an optimum hyperplane to separate instances into classes. In such cases the hyperplane is found for data that is mapped into high dimensional data space known as transformed feature space. While dealing with linearly separable data, single layered perceptron is considered well for classification. Multi layered perceptron also known as Artificial Neural Networks (ANN) come into action while dealing with non-separable data. ANN is composed of input unit, hidden unit for processing and output unit for holding results. The classification process in ANN includes activation of signals at input, hidden and output unit. At the beginning, input unit is activated and depending upon some activation criteria, input unit fire its output to hidden unit to which that input unit is connected. Hidden units fire its signal to output unit based on some activation value. The activation function at output unit sums up the processing outputs of sending units, which is further adjusted, for example, a value between 0 and 1. However, the performance of any of the above methods depends upon several factors, for example, number and type of feature instances, size and dimensionality of data, storage and processing requirements, etc. For example, decision trees and related methods perform well with categorical or discrete data, on the other hand ANN and SVM performs well with continuous features, well handles multicol-linearity and non-linearity. As far as training time is concerned, decision trees and NB train quickly as compared with SVM and NB. From storage point of view, NB requires a little storage space during classification or training, however K-NN demands huge storage space during training. The performance behavior of different classifiers evidenced by literature [28], is graphically depicted in Figure 1. It is obvious that the accuracy-based performance of discussed algorithms varies while depending upon several factors. Every classifier has strengths and weaknesses. At this point, the concept of ensemble learning become evident that suggests to combine classifiers for better performance against individual classifier. Two of the famous ensembles are bagging and boosting which try to minimize the classification error by dealing with bias and variance causing errors. Bagging or more commonly, Bootstrap Aggregating attempts to get a generalized result by combining the results of various models through bootstrapping which suggests the making of subsets of provided observations from the dataset with replacements [7]. The size of these subsets (bags) is kept same as of the original one. For each bag, same base model is created and the individual predictions from different models are combined to achieve the overall results. Boosting [29]- [31] is the sequential process of prediction in which every previous model provides the basis of predictions for the next model. Next model in the series is responsible of correcting the errors or fulfilling the deficiencies of prior model in order to classify the data items correctly. If some input is classified incorrectly then its weight is increased so that in next iteration this item may be classified correctly. It can easily be understood that every next model is dependent on the previous one. In this way, the weak learners are equipped to produce a strong model.
Current study incorporates bagging and boosting functions for classification available in caret package which is considered an inclusive framework for building machine learning models in R. The methods incorporated include different variants of bagging and boosting. For example, Bagged Flexible Discriminant Analysis (bagFDA), TreeBag and AdaBag are selected from bagging family, which are the variants of bagging and different tuning parameters in each classifier effect the final outcome. The methods selected from boosting family include AdaBoost.M1, XGBLinear, XGB-Dart and XGBTree. The data used in experiment is collected from government hospitals of Muzaffarabad, the capital city of Kashmir. Questionnaires were filled in the presence of OB-Gyn. 79 different historical, maternal, social and physical factors including 98 subjects were acquired. The study attempts to analyze the effect of maternal health factors on the mode of the birth. The cause analysis of social factors and birth outcome is not in the scope of current study. Keeping this in mind, 488 instances with 24 factors (maternal history, during pregnancy) and 1 response variable is selected to perform classification using bagging and boosting machine learning algorithms. The factors used in experiments are presented in Table 1. Few important features of the data used are provided below.
1) Out of 488 cases with first successful live births, 296 women delivered via C-section and 192 delivered naturally. 2) 312 women delivered between 20 years to 27 years of age. 3) Higher rate of C-section is reported among women in early age group (17-23 years) and in late 30's.   4) The frequency of miscarriage is reflected higher among women in late 30's. 5) Frequency of abortions is higher among women between 18 years to 25 years age 6) Women who have gone through surgeries, tend to have high blood pressure, lower levels of hemoglobin, diabetes and hypertension delivered via C-section. 7) The number of new born baby girls is nearly double to new born baby boys. The data was divided in conventional 75% training and 25% test data samples. Cross validation [32] is used to mitigate overfitting of models. The bagging and boosting based models are trained using train data. These models are then used to predict the class label that is later compared to actual class label in test data.
The training and testing accuracy of bagging and boosting based models is presented in Table 2 and depicted in Figure 2. These results demonstrate that BagFda and Adabag have higher training accuracies as compared to the rest. The higher testing accuracy by BagFda shows its superiority while dealing with the prediction of the class label. This superiority is not just accidental. As discussed earlier, bagging is a combination of bootstrapping and aggregation that aims at reducing the variance by averaging together several estimates.
In our case, bootstrapping sampling provided the subsets of data to train the base learners and voting for classification scheme aggregated the output of base learners once. On the other hand, Adaboost begins the training with assigning equal   weights to all coefficients. In upcoming rounds, boosting puts more weights to the coefficients of misclassified instances and less for those that are already handled well. This scheme of classification is better than any linear model where the predictions are the function of linearly scaled parameters. These strengths of bagging and boosting influenced and increased the mean training and testing accuracies in comparison to previous study [6]. Results are further supported with evaluation standards including Kappa statistics, confusion matrix, F-measures and ROC curves. The calculation based on the variations among the independent incorporated classifiers that is difference between actual agreement present vs. agreement present by chance alone are provided in Table 3 and are known as Kappa Statistics [33] that stands useful to evaluate inter classifier variation. Literature reflects that the kappa value above 0.81 or 81% represents substantial agreement  between classifiers. Table 3 shows that every classifier falls into substantial agreement range. BagFda and AdaBag have marginally higher mean Kappa measure as compared to the rest. The summary of results based on mean accuracy from Tables 2 and 3 are shown in Figure 3.
Another useful representation for evaluating the performance of classification model regarding its strength for correct prediction rate is by producing confusion matrix. Confusion matrix use the predicted class label from test data for which the actual class labels are known and provides true positives, false positives, true negatives and false negatives results for the classifier. The confusion matrix for all classifiers incorporated in current study is provided in Figure 4 (A to G). For example, the confusion matrix of Adabag 4(C) depicts that out of 122 instances of test dataset, Adabag has correctly classified 63 instances from class A (column A, C-section) and misclassified 4 instances. Furthermore, it has correctly classified 50 instances from class B (Column B, Normal delivery) and misclassified 5 instances.
In the presence of confusion matrix comprising of true positives and false positives, we move further towards producing the ROC curves that represents the sensitivity and specificity of a classifier [34], [35]. A classifier with high accuracy rates has ROC curves near to upper left corner. The sensitivity that is known as recall of a classifier is the function of total number of relevant instances and total number of retrieved relevant instances. Specificity that is known as precision is the function of total number of relevant instances and total number of retrieved instances. The classifier with low false positive rate acquires higher precision, that consequently reflect its higher accuracy capability. On the other hand, higher recall is achieved when classifier returns most of the positive results. The Precision and recall measures for all classifiers are provided in Table 4. The percentages between these two measures i.e., sensitivity and specificity, is represented using ROC curves, as discussed above. The ROC curve for classifiers with AUC values is provided in Figure 5 (A to G). Finally, we compare the results of current VOLUME 8, 2020 study with a previous study in same domain. Top 2 classifiers in terms of highest mean accuracy are selected from the previous (RF and SVM) and current article (Adabag and BagFda). The highlighted values in Table 5 represent the highest produced value by a classifier in column head category. Results clearly show the improvements in the terms of training accuracy, testing accuracy and Kappa statistics.

IV. CONCLUSION
The regional disparities have motivated the authors to conduct a cause analysis and classification based study on locally collected birth data set. This study is first of its kind in the region that incorporates bagging and boosting classification functions applied using caret package which is considered an inclusive framework for building machine learning models in R. Accuracy based results with several evaluation measures are presented for comprehensive comparison. Bagging functions including Adabag and BagFda performed marginally better in terms of accuracy, precision and recall. In results, improvements are observed as compared to previous classification-based study performed on same dataset. BagFda is appropriate classification model for our birth data with highest mean training and testing accuracy i.e., 94.53% and 93.44% respectively. It is believed that once the best classification model is identified, it would result in the development of reliable decision support systems to create opportunities for physicians to take decisions based on the information provided by these decision support systems. Moreover, several maternal and health related factors are associated that cause complication during pregnancy that consequently demand C-section. For example, early age marriages put women at higher risks to deliver surgically. Physicians must address the causes that create complications during pregnancy, for example high blood pressure, lower levels of hemoglobin, diabetes and hypertension. Mitigating these complications may perhaps avoid situations demanding C-section. Furthermore, government must address the future needs for a society with probably increased number of women as compared to men. Above all, the continuity of such interdisciplinary study in the region is essential that will help healthcare organizations to take effective measures towards the safety of expecting mother and fetus.  VOLUME 8, 2020 FIAZ MAJEED received the Ph.D. degree in computer sciences from the University of Engineering and Technology, Lahore, Pakistan, in 2016. He is currently serving as the Head of the Software Engineering Department, under the Faculty of Computing and Information Technology, University of Gujrat (UOG), Pakistan. His research interests include data warehousing, data streams, information retrieval, and social networks. He has published more than 20 papers in refereed journals and international conference proceedings in the above areas. He is a reviewer of several renowned international journals.