An ensemble based approach using a combination of clustering and classification algorithms to enhance customer churn prediction in telecom industry

Mobile communication has become a dominant medium of communication over the past two decades. New technologies and competitors are emerging rapidly and churn prediction has become a great concern for telecom companies. A customer churn prediction model can provide the accurate identification of potential churners so that a retention solution may be provided to them. The proposed churn prediction model is a hybrid model that is based on a combination of clustering and classification algorithms using an ensemble. First, different clustering algorithms (i.e. K-means, K-medoids, X-means and random clustering) were evaluated individually on two churn prediction datasets. Then hybrid models were introduced by combining the clusters with seven different classification algorithms individually and then evaluations were performed using ensembles. The proposed research was evaluated on two different benchmark telecom data sets obtained from GitHub and Bigml platforms. The analysis of results indicated that the proposed model attained the highest prediction accuracy of 94.7% on the GitHub dataset and 92.43% on the Bigml dataset. State of the art comparison was also performed using the proposed model. The proposed model performed significantly better than state of the art churn prediction models.


INTRODUCTION
Data mining is the way of identifying patterns and extracting knowledge from large amount of data (Han, Pei & Kamber, 2011). It allows us to identify the future flow employing different prediction models. Enterprises can predict future patterns and trends using different data mining tools and techniques. The purpose of data mining is to analyse meaningful information that is hidden in huge amounts of datasets and incorporate such information for useful tasks (Rustam et al., 2019).
The telecom sector is becoming one of the most important industries in developed countries since the past two decades . Data mining plays a vital role for prediction and analysis in the telecom industry due to availability of huge data. The basic application area is to perform prediction of churner in order to save customer retention and to make a high-profit rate. Data mining techniques are used in the telecom sector to observe the churn behaviour of the customers.
With the increasing rate in users of telecom companies, they now offer variety of services for the retention of customers. In order to obtain better services and benefits, the customer switches its service provider and the phenomenon is known as churn. If a customer switches a service provider's company then face loss occurs in the company's revenue. Prediction can be performed to identify the potential churners and retention solutions may be provided to them. A large number of mining algorithms are available which classify the behavior of customers into churner and non-churners.
The telecom sector is growing rapidly due to different technologies. Different companies provide quality of data for communication; some gives better services as compared to others. In order to stop churn, companies offer different services which are attractive for their customers. Data mining technologies are used to perform churn prediction using different algorithms like Naïve Bayes, decision tree, neural network and logistics regression etc. An accurate prediction model is very helpful for correct identification of customer's churn and plays a vital role in making decisions about their retention (Vijaya & Sivasankar, 2018). The best customer churn prediction model can identify churner and gives directions to decision-makers for generating maximum profit (Höppner et al., 2017;Ali et al., 2018;Amin et al., 2019).
There are number of reasons for which churning of customers. Most important of them are calls or packages rate which do not suit the customer (Tiwari, Sam & Shaikh, 2017;Petkovski et al., 2016). We can identify weather a customer wants to leave or not on the basis of his historical data and behavior.
Existing studies show that an efficient churn prediction model should efficiently use large volume of historical data in order to perform churner's identification. However, there are number of limitations in existing models due to which it is not possible to perform churn prediction efficiently and with high accuracy. A large volume of data is generated in telecom sector which contains missing values. Prediction on such type of data results in poor/inaccurate outputs for prediction models in literature. Data preprocessing is now performed to resolve this issue and missing values imputation is performed using machine learning methods which results in high performance and classification/prediction accuracy. Feature selection is also performed in literature; however, some important and information rich features are neglected during model development. Moreover, statistical methods are used for model generation which results in poor prediction performance. Furthermore, benchmark datasets are not used for model evaluation in literature resulting in poor representation of true picture of data. Fair comparison between different models cannot also be performed without benchmark datasets. An intelligent model can be used to resolve the existing issues and to provide churn prediction more accurately.
The proposed churn prediction model is based on combination of clustering and classification algorithms. The performance of proposed model is evaluated on different churn prediction datasets. The evaluation of proposed churn prediction model is evaluated using different metrics such as accuracy, precision, recall and f-measure. The objectives of proposed research are; to identify the issues in literature and provide an efficient model for customer churn prediction, to identify the churners with high accuracy. The retention strategies may then be provided to the potential churners. It is also observed from the experiments that proposed churn prediction model performed better in terms of churn prediction by achieving high accuracy.
The main contributions of this research are as follows: Proposed a churn prediction model with high accuracy; Data preprocessing is performed for missing values imputation, noise removal and duplicates removal; Selection of important features using feature selection technique; Combination of clustering and classification techniques to perform customer churn prediction on two large datasets of telecom sector; Customer profiling is performed using clustering technique to divide the behavior of customer into different groups like low, medium and risky.
The remaining organization of paper is as follows: "Literature Review" provides literature review. "Proposed Methodology" presents the proposed churn prediction model. Experimental evaluation and results are presented in "Experimental Results, Evaluation and Discussion". Finally, "Conclusion and Future Work" presents the conclusion and future work.  proposed a churn prediction model named JIT-CCP model. In this model first step is data pre-processing, second step is binary classification and evaluation of performance by using the values of confusion matrix corresponding to true positive, true negative, false positive and false negatives. Based on these terms, probability of detection is calculated. Probability of detection (PD) is used to calculate the accuracy of multiple classifiers. If PD value is near to 1 then the classifier's results are much better and vice versa. However, the proposed model is not suitable for a large amount of data. Vijaya & Sivasankar (2018) presented that customer retention plays a valuable role in the success of a firm. It not only increases company's profit but also maintains company's ranking among telecom industry. Customer retention is less costly rather than making new customers. So maintenance of the customers and customer association management (CAM) are the two parameters for the success of every company. In this research hybrid model of supervised and unsupervised techniques are used for churn prediction. There are different stages of this modal. In the first stage, data is cleaned and removed different deviations from data. In next stage, testing and training sets of data are obtained from different clusters. After this, prediction algorithms are applied. In the final stage accuracy, specificity and sensitivity are measured for evaluating the efficiency of proposed model. Höppner et al. (2017) stated that customer retention policies rely on different predictive models. The most recent development is expected to maximize the profit (EMPC) which selects the most valuable churn model. In this research a new classification method has been introduced, which integrates the (EMPC) matrix directly to churn model. This technique is known as ProfTree. The main advantage of this model is that a telecom company can gain maximum profit. The proposed model has increased performance and accuracy as compared to other models. In future this model may be combined into different algorithms to further increase the prediction performance. Ali et al. (2018) used different mining algorithms and techniques for prediction of churners. WEKA software is used for applying different classifiers. The first step of this model is data preprocessing where missing values are removed. After preprocessing, FSS (Feature Subset Selection) steps are performed where feature reduction is performed. It also reduced cost of securing the data. After that, information Gain Ratio is performed to rank the target dataset. The advantage of this research is to identify interesting patterns for prediction of churner's behaviour. The disadvantage of this research is that if dataset will be increased then the process will become slow and requires more time for prediction.

LITERATURE REVIEW
Bharat (2019) proposed a model based on the activity pattern of customers. Specifically, its measurement is based on the customer's activity by finding the average length of inactive time and frequencies of inactivity. The proposed method can be used in other domains for churn prediction.
Gajowniczek, Orłowski & Ząbkowski (2019) used Artificial Neural Network with entropy cost functions for prediction model of customers. Numerical method like classifications tree or SVM provides higher accuracy in classification, which shows the simplest way to apply the new q-error functions to conclude the issue. Zhang et al. (2018) stated that customer churn is valuable for telecom companies to retain weighty users. A customer ccp model having more accuracy is very weighty for decision of customer retention. In this paper SVM technique is also used because it is much better for precision. It solves samples under low dimensional space which is linear inspirable in two dimensional space. There is a limitation of proposed model like it is very difficult to quantify churned customers. Therefor there should more complex investigation. Ahmed et al. (2020) proposed a model based on combination of different classifiers in order to create hybrid ensembles model for prediction. In this paper bagged stack learners are proposed. Experimentation is performed on two datasets related to telecom companies. High accuracy is obtained. The benefit of this model is that it does not work on generalized data sets. Brownlow et al. (2018) introduced a new methodology for churn prediction in fund management services and implementation. This framework is based on ensembles learning and a new weighting mechanism is proposed to deal with imbalanced cost sensitivity problem with financial data. In this model heterogeneous type of data are used collected from different companies. The performance of this model may be increased with extraction and enhancement of learning methods. Vo et al. (2018) used text mining and data mining methods for the prediction of churn. Multiple methods are used for the prediction of churns like semantic information and word importance. This model uses only unstructured data for prediction. In future this research can be extended into segmentation and building personalized recommendation system for different financial services and products.
Calzada-Infante, Óskarsdóttir & Baesens (2020) performed comparison of two techniques Time-Order-Graph and Aggregated-Static-Graph with forest classifier using three threshold measures to evaluate the predictive performance of the similarity forest classifier with each centrality metric.
Nguyen & Duong (2021)  Although a lot of work has been done on churn prediction but still there is room for improvement. There is a need of churn prediction model which has high prediction accuracy. The proposed modal is based on combination of clustering and classification techniques and attained high prediction accuracy. The summary of literature is shown in Table 1.

PROPOSED METHODOLOGY
The proposed model has increased the churn prediction performance by using a hybrid model where combinations of different clustering and classification ensembles are introduced. Figure 1 shows different modules of proposed model.

Data acquisition
The first task of proposed model is data acquisition. Two benchmark churn prediction datasets have been used for model evaluation. These datasets are acquired from online data repositories. First churn prediction dataset is collected from GitHub which is freely available online data repository for research. This dataset contains 5,000 instances with 707 churn customers and 4,293 non-churn customers. The second dataset is collected from Bigml platform which is also freely available online repository containing 3,333 instances with 21 attributes having 483 churns and 2,850 non-churn values. This dataset contains the information about customer's concerns behavioural, demographics and revenue information.

Data preprocessing
The main purpose of data pre-processing is to remove noise, anomalies, missing values and duplication from data (Azeem, Usman & Fong, 2017). In pre-processing, a model needs to remove missing values, noisy data, duplication and only needs to use important features from data (Omar et al., 2021). Data preprocessing is the first step which is applied on churn prediction data. The proposed churn prediction model has incorporated following tasks during data pre-processing.
Data Cleaning: Prediction is very difficult when there are missing values, duplication and noise in the data. So data cleaning is performed to replace missing values with actual values which are calculated by each attribute mean, remove duplicated data and noise/error values are identified and removed. Feature Selection: Feature selection is the most important step of data pre-processing. Feature selection is performed using forward selection and most important features are chosen for prediction model. Data Reduction: In this step data is reduced in smaller volume for producing compact and understandable results.

Clustering algorithms
After data preprocessing, clustering is applied on cleaned and refined data. The proposed model has employed clustering in order to improve the prediction performance. Following clustering methods have been used by the proposed model.

K-means clustering
K Means clustering algorithm divides N rows into K segments, and K is always less than N. It randomly selects the value of k which represents centre of cluster mean. It measures the distance between the clusters and compute mean for every cluster. The process continuous iteratively until desired clusters is refined. Following formula is used to measure the distance (Gajowniczek, Orłowski & Ząbkowski, 2019).  JðVÞ ¼ where ||x i -v j || is the Euclidean distance between two clusters.

K-medoids clustering
In 1987, Rousseeuw Lloyd and Kaufman introduced a clustering technique which is also partitioned based and is termed as K-Medoids algorithm. K-Medoids is more robust to noise and outliers as compared to k-means (Gajowniczek, Orłowski & Ząbkowski, 2019). Following formula is used to calculate the cost of each cluster.
where P i and C i are objects for which dissimilarity is calculated.

X-means clustering
X means clustering is a variation of k means clustering where clusters are refined and subdivided repeatedly until Bayesian Information Criteria (BIC) is reached. The efficient estimation of number of clusters is obtained automatically instead of take input from user in the form of K. Covariance of each cluster is measured and following formula is used to calculate the variance (Gajowniczek, Orłowski & Ząbkowski, 2019).
where R and K are number of points and number of clusters respectively and µ is the centroid of i cluster.

Random clustering
Random clustering is often used in Rapidminer to perform random flat clustering of given dataset. Moreover, some of the clusters can be empty and the samples are assigned to clusters randomly.

Classification and prediction algorithms
After performing clustering, Classification is performed by the proposed model. Each clustering method is evaluated and best clustering method is combined with classification algorithms. The proposed model first used single classifiers and their performance is measures. Then, ensemble classifiers are used along with clustering to attain the highest prediction accuracy. The combination of clustering and ensemble classifier which has attained highest churn prediction accuracy will be considered as proposed model. Following classifiers are used by the proposed churn prediction model.

K-nearest neighbor
The k-nearest neighbor is one of the simplest classification methods in data mining. Following distance formula is used to measure the distance ( where k is number of samples in training data x and y are instances for which distance is calculated.

Decision tree
Quinlan introduced in 1993 a divide and conquer method and termed as decision tree. Entropy and then information gain is calculated for each attribute using the following formulas.
HðSÞ ¼ where S is the dataset, X is set of classes in S and p(x) is probability of each class.
IGðS; AÞ ¼ HðSÞ À X t2T pðtÞHðtÞ ¼ HðSÞ À HðSjAÞ (6) where H(S) is entropy of S, T is the subset of s, p(t) is probability of subset t and H(t) is entropy of subset t.

Gradient boosted tree
The idea behind GBT is to improve the prediction accuracy by producing ensemble of decision trees. GBT outperforms random forest as it produces the ensemble of weak prediction models. The prediction is given as follows: where y is the prediction made by input x. ϴ is the best parameter that best fits the model.

Random forest
Random forest or random decision forest is an ensemble learning method used for classification, regression and other tasks that operate by constructing a multitude of decision tree at training time and outputting the class that is the mode of the classes or average prediction (regression) of the individual trees. Entropy or Gini index are used for tree construction using following formulas.

Deep learning
It is a multi-layer technique which compares large number of layers of neurons. It is an artificial network which is used to solve more complex and difficult problems of data mining.

Naïve Bayes
Naive Bayes method is a supervised learning algorithm based on applying Bayes' theorem. Following formula is used by the Naïve Bayes classifier.
PðcjxÞ ¼ PðxjcÞPðcÞ PðxÞ where P(c) is the prior probability of class, P(x) is prior probability of predictor, p(x|c) is probability of predictor given class and p(x|c) is posterior probability of class.

NB (K) (Naïve Bayes Kernel)
The Naive Bayes (Kernel) operator can be applied on numerical attributes. A kernel is a weighting function used in non-parametric estimation techniques. Kernels are used in kernel density estimation to estimate random variables' density functions, or in kernel regression to estimate the conditional expectation of a random variable.

Ensemble classifiers
Krawczyk et al. (2017) used ensemble methods to apply multiple learning algorithms for prediction. Ensembles increase the performance of the system or model (Rustam et al., 2020). Following ensemble models are used by the proposed model.

Voting
Voting method is used to combine the results of individual classification algorithms using majority voting. Each individual classifier assigns a class label to test data, then their results are combined using voting and final class prediction is generated using maximum number of votes for a particular class (Gupta & Chandra, 2020). Following formula is used to apply majority voting on dataset: where T represents the number of classifiers, and d(t,J) is the decision of classifier and J represents the classes.

Bagging
Bagging stands for Bootstrap Aggregation. It is an ensemble classifier which has bag of similar and dissimilar objects. It helps to decrease the variance of the classifiers which are used in prediction model to make better performance (Brown et al., 2005). Then evaluation of Bagging is given as follows: where t represents training samples, h t represents trained classifiers and w i represents class labels. Each class will have total votes represented by: AdaBoost is the short form of Adaptive Boosting, is a Meta algorithm, which can be used in conjunction with different other learning algorithms to improve their performance (Brown et al., 2005). Weighted majority voting is applied on the classifiers. Every classifier gets equal opportunity to draw samples in each iteration. Following formula is used to apply weighted majority voting: where b t is normalized error, h t represents trained classifiers and w j represents class labels of training data.

Stacking
Stacking is used for combining different leaners rather than selecting among them. It can be used for getting a performance better than any single one of the trained models. Bootstrapped samples of training data are used to train the classifiers. There are two types of classifiers used in stacking, Tier-1 classifiers and Tier-2 classifiers. Tier-1 classifiers are trained on bootstrapped samples and generate prediction, their result are then used to train Tier-2 classifiers. This way training data is properly used to perform learning (Brown et al., 2005).

Working of proposed model
The proposed model has increased the churn prediction performance by using a hybrid model where clustering methods and classification methods are combined. Combinations of different clustering and classification ensembles are introduced and best combination models are selected for final prediction. First clustering is used to generate the clusters of given dataset. "Map clustering on Label" operator is used to assign labels to data. Then classification is performed for labelled data to generate the results. It is also proved that performance; accuracy and efficiency of churn prediction model can be increased by using the proposed novel hybrid models. As single classifier based model cannot provide high accuracy, therefore the proposed models used the hybrid model for prediction of churn.
1. First of all clustering evaluation is carried out and results are obtained and select best clustering technique on the behalf of accuracy.
2. After clustering, single classifier based classification is performed and then accuracy, precision, recall and f-measure results are obtained.
3. After that hybrid model of best clustering method and each single classifier is developed and performance results are obtained for each dataset.
4. Next, only ensemble classifiers based models are developed and evaluated on both datasets.
5. Then, these ensemble models are combined with best clustering technique in order to make hybrid models and performance is evaluated. It is clear from the evaluation that proposed combination of clustering and ensemble models has achieved highest prediction accuracy as compared to state of the art models for both churn prediction datasets.

EXPERIMENTAL RESULTS, EVALUATION AND DISCUSSION
The experimental of proposed model is performed on two benchmark churn prediction datasets. First, clustering techniques are performed on each dataset and best clustering method is selected. Next, single classifiers are performed on each dataset and their performance is evaluated, shown in "Map Clustering on Label". Then, single classifiers are evaluated along with K-Medoid clustering and again their performance is evaluated for each dataset as shown in "Clustering with Single Classifier on Churn Prediction Datasets". It is analyzed that performance of single classifiers is improved. Afterthat, ensembles (Voting, Bagging, Stacking, AdaBoost) are evaluated on each dataset along with K-Medoid clustering as shown in "Clustering with Ensembles on Churn Prediction Datasets". The analysis indicates that AdaBoost ensemble along with clustering performed better as compared to other ensembles for both churn prediction datasets. Following datasets are used for the experiments and evaluation which are freely available at online data repositories.

GitHub dataset
First churn prediction dataset is collected from GitHub which is an online data repository. The datasets are freely available over here for research. The dataset name is "Kaggletelecom-customer-churn-prediction" obtained from the data source https://www.kaggle. com/blastchar/telco-customer-churn. It is used to predict customer's behaviour to retain them. It contains 5,000 instances data where each row represents a customer and columns represents customer's attributes. The dataset contains 707 churn customers and 4,293 non-churn customers.

BigML dataset
The second dataset is collected from Bigml platform which is also freely available online data repository. The dataset is obtained from data source https://cleverdata.io/en/bigdatapredictions-bigml/. The name of dataset is "Churn in Telecom's dataset". It contains 3,333 instances having 21 attributes. There are 483 churns and 2,850 non churn customers in the dataset. This dataset is also used to predict the customer's behaviour.

Rapid miner
Rapid Miner is a data science software platform that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. The proposed research is implemented using Rapid Miner. It is also freely available on the web.

Model evaluation
Confusion matric is used for model evaluation. With the help of confusion matrix performance of proposed model is analysed. The performance of proposed model is analysed using accuracy, precision, recall and f-measure Jamil et al., 2021;Rustam et al., 2021). These parameters can be measured with the help of following formulae where TP represents true positives, TN is true negatives, FP shows false positives and FN shows false negatives.  Table 2 shows clustering evaluation results on GITHUB and Bigml datasets.

Clustering with single classifier on churn prediction datasets
As it is clear from the literature that single classification techniques show low classification accuracy as compared to hybrid model, therefore now supervised and unsupervised techniques are combined to generate hybrid model and then this hybrid model will be used for classification in order to increase the accuracy level. It is analysed from Table 2 that k-med shows higher accuracy as compared to other clustering techniques therefore now the combination of k-med with seven different classification algorithms (GBT, DT, RF, kNN, DL, NB, NB(K)) is applied one by one on each dataset as shown in Table 3.

Clustering with ensembles on churn prediction datasets
Tables 4-7 show that k-med clustering is combined with different combination of classifiers. Voting, Bagging, Stacking and AdaBoost ensembles are used. The combination of GBT, DT and DL shows highest accuracy when it is combined with k-med clustering.

Comparison of different techniques
Now the comparison of all techniques has been carried out. The comparison shows different levels of accuracy for different hybrid models. The average accuracy of different techniques has been compared. Table 8 shows the comparison results. As it is clear from experiments that results are improved on each step because a hybrid approach is used to improve the results.

Comparison with state of the art techniques
Tables 9 and 10 show the comparison of proposed model with different state of the art techniques. Proposed model shows higher accuracy as compared to existing techniques. In this research hybrid models of supervised and unsupervised learning techniques is proposed and implemented with rapid miner. These models are applied on two datasets which are freely available on online data repositories. In first step clustering algorithms are selected i.e. k-means, K-medoid, X-means and Random Clustering are selected for  experimentation. After evaluation of these clustering algorithms it is noticed that the k-medoid showed high accuracy as compered other three clustering algorithms, so K-medoid is selected for hybrid model generation. After selection of clustering algorithm next step is selection of classification algorithm. Seven different classification algorithms   are selected based on literature which is GBT, DT, RF, KNN, DL, NB, NB (K) for classification. These algorithms have high performance for churn prediction datasets. After evaluation it is noticed that GBT shows high accuracy level. After separate single experimentation a hybrid model of clustering and single classification algorithm is developed to perform the results. In this experimentation K-med and GBT shows better accuracy as compared to other combinations. After combination of single classification algorithm and clustering a hybrid model of classification algorithms is implemented and experiments are performed. The main reason of implementation of hybrid model is that it shows better accuracy as compared to single classifiers (Khairandish et al., 2021;Sujatha et al., 2021), so different combination of above mentioned classifiers are used for experimentation. These hybrid classifiers are used with k-med clustering for better results. After this step ensemble classifiers voting, bagging, adaBoost and stacking are used with hybrid model of clustering and classification. These ensemble classifiers are used to increase the accuracy level. With the combination of ensemble classifiers and clustering algorithm, models show better accuracy for churn prediction.

CONCLUSION AND FUTURE WORK
Customer churn prediction is a critical problem for telecom companies. The identification of customers that are unhappy with the services provided allows the companies to work on their weak points, pricing plans, promotions and customer preferences to reduce the reasons for churn. Many techniques are used in literature to predict customer's churn. The proposed research focused on introducing different models for customer churn prediction with high accuracy. Novel hybrid models were introduced by combining clustering and classification approaches. The proposed models were then evaluated on two churn prediction datasets obtained from online data repositories. The analysis of results show that proposed models have achieved higher classification and prediction accuracy as compared to existing state of the art models. In this work the combination of k-med clustering and GBT, DT and DL classifier ensemble shows higher accuracy when compared to other methods. This research can be extended in future by using big data analytics. Social network analysis can be used to identify the customer's satisfaction level towards telecom services and then these services can be offered to reduce the churn rate. Further datasets can also be used to increase the confidence level on results. Finally, the models can be applied on different sectors like banking, insurance or airline and prediction accuracy can be compared.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.