The number of clusters in hybrid predictive models: does it really matter?

Mariusz Łapczyński; Bartłomiej Jefmański

doi:10.5604/01.3001.0013.9131

Mariusz Łapczyński Cracow University of Economics, College of Management and Quality Sciences, Department of Market Analysis and Marketing Research ORCID:https://orcid.org/0000-0002-4508-7264 , Bartłomiej Jefmański Wroclaw University of Economics and Business, Faculty of Economics and Finance, Department of Econometrics and Informatics ORCID:https://orcid.org/0000-0002-0335-0036 Przegląd Statystyczny. Statistical Review, vol. 66, 2019, 3, pages: 228-238 Published online: 17 March 2020 DOI 10.5604/01.3001.0013.9131

814 Views 61 Downloads

ARTICLE

(English) PDF

ABSTRACT

For quite a long time, research studies have attempted to combine various analytical tools to build predictive models. It is possible to combine tools of the same type (ensemble models, committees) or tools of different types (hybrid models). Hybrid models are used in such areas as customer relationship management (CRM), web usage mining, medical sciences, petroleum geology and anomaly detection in computer networks. Our hybrid model was created as a sequential combination of a cluster analysis and decision trees. In the first step of the procedure, objects were grouped into clusters using the k-means algorithm. The second step involved building a decision tree model with a new independent variable that indicated which cluster the objects belonged to. The analysis was based on 14 data sets collected from publicly accessible repositories. The performance of the models was assessed with the use of measures derived from the confusion matrix, including the accuracy, precision, recall, F-measure, and the lift in the first and second decile. We tried to find a relationship between the number of clusters and the quality of hybrid predictive models. According to our knowledge, similar studies have not been conducted yet. Our research demonstrates that in some cases building hybrid models can improve the performance of predictive models. It turned out that the models with the highest performance measures require building a relatively large number of clusters (from 9 to 15).

KEYWORDS

hybrid predictive model, k-means algorithm, decision trees

JEL

C10, C18, C52

REFERENCES

Asuncion A., Newman D., (2007), UCI machine learning repository, http://archive.ics.uci.edu.

Blattberg R., Kim B. D., Neslin S., (2008), Database Marketing – Analyzing and Managing Customers, 1st ed., Springer, New York. DOI: 10.1007/978-0-387-72579-6.

Bose I., Chen X., (2009), Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn, Journal of Organizational Computing and Electronic Commerce, 19(2), 133–151, DOI: 10.1080/10919390902821291.

Breiman L., Friedman J., Olshen R., Stone C., (1984), Classification and Regression Trees, 1st ed. Wadsworth statistics / probability series, Wadsworth Publishing Company, Belmont, California.

Chu B. H., Tsai M. S., Ho C. S., (2007), Toward a Hybrid Data Mining Model for Customer Retention, Knowledge-Based Systems, 20(8), 703–718. DOI: 10.1016/j.knosys.2006.10.003.

Everitt B., Landau S., Leese M. D. S., (2011), Cluster Analysis, 5th ed. Wiley Series in Probability and Statistics, John Wiley & Sons, Chichester, West Sussex. DOI: 10.1002/9780470977811.

Ferraretti D., Lamma E., Gamberoni G., Febo M., Di Cuia R., (2011), Integrating Clustering and Classification Techniques: A Case Study for Reservoir Facies Prediction, [in:] Ryżko D., Gawrysik P., Rybiński H., Kryszkieiwcz M., Emerging Intelligent Technologies in Industry, Springer, Berlin, Heidelberg, 21–34. DOI: 10.1007/978-3-642-22732-5_3.

Gaddam S., Phoha V., Balagani K., (2007), K-means + ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-means Clustering and ID3 Decision Tree Learning Methods, IEEE Transactions on Knowledge and Data Engineering, 19(3), 345–354.DOI: 10.1109/TKDE.2007.44.

Khan D., Mohamudally N., (2011), An Integration of k-means and Decision Tree (ID3) Towards a More Efficient Data Mining Algorithm, Journal of Computing, 3(12), 76–82, https://sites. google.com/site/journalofcomputing/volume-3-issue-12-december-2011.

Łapczyński M., Jefmański B., (2013), Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees, [in:] Perner P., (ed.), Advances in Data Mining, Ibai Publishing, Fockendorf, 153–162.

Łapczyński M., Surma J., (2012), Hybrid Predictive Models for Optimizing Marketing Banner Ad Campaign in Online Social Network, [in:] Stahlbock R., (ed), Proceedings of the 2012 International Conference on Data Mining (DMIN), CSREA Press, Las Vegas, Nevada, 140–146.

Li Y., Deng Z., Qian Q., Xu R., (2011), Churn Forecast Based on Two-step Classification in Security Industry, Intelligent Information Management, 3(4), 160–165. DOI: 10.4236/iim.2011.34019.

Lloyd S., (1982), Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137, Institute of Electrical and Electronics Engineers (IEEE). DOI: 10.1109/ TIT.1982.1056489.

Shouman M., Turner T., Stocker R., (2012), Integrating Decision Tree and K-Means Clustering with Different Initial Centroid Selection Methods in the Diagnosis of Heart Disease Patients, [in:] Stahlbock R., (ed), Proceedings of the 2012 International Conference on Data Mining (DMIN), CSREA Press, Las Vegas, Nevada, 24–30.

Walesiak M., Dudek A., (2011), clusterSim: Searching for Optimal Clustering Procedure for a Data Set, https://cran.r-project.org/web/packages/clusterSim. R package version 0.47–3.