Skip to main content
Log in

A novel discretization algorithm based on multi-scale and information entropy

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Discretization is one of the data preprocessing topics in the field of data mining, and is a critical issue to improve the efficiency and quality of data mining. Multi-scale can reveal the structure and hierarchical characteristics of data objects, the representation of the data in different granularities will be obtained if we make a reasonable hierarchical division for a research object. The multi-scale theory is introduced into the process of data discretization and a data discretization method based on multi-scale and information entropy called MSE is proposed. MSE first conducts scale partition on the domain attribute to obtain candidate cut point set with different granularity. Then, the information entropy is applied to the candidate cut point set, and the candidate cut point with the minimum information entropy is selected and detected in turn to determine the final cut point set using the MDLPC criterion. In such way, MSE avoids the problem that the candidate cut points are limited to only certain limited attribute values caused by considering only the statistical attribute values in the traditional discretization methods, and reduces the number of candidates by controlling the data division hierarchy to an optimal range. Finally, the extensive experiments show that MSE achieves high performance in terms of discretization efficiency and classification accuracy, especially when it is applied to support vector machines, random forest, and decision trees.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Bache K, Lichman M (1998) Uci repository of machine learning databases http://archive.ics.uci.edu/ml

  2. Breiman L, Friedman J H, Olshen R A, Stone C J (1984) Classification and regression trees. belmont, ca: Wadsworth. Int Group 432:151–166

    Google Scholar 

  3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  4. Cano A, Luna JM, Gibaja EL, Ventura S (2016a) Laim discretization for multi-label data. Inf Sci 330:370–384

  5. Cano A, Nguyen DT, Ventura S, Cios KJ (2016b) ur-caim: improved caim discretization for unbalanced and balanced data. Soft Comput 20(1):173–188

  6. Cao F, Tang C, Zhang J (2017) Algorithm of continuous attribute discretization based on binary ant colony and rough sets. Comput Sci 44(9):222–226

    Google Scholar 

  7. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):1–27

    Article  Google Scholar 

  8. Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15(4):319–331

    Article  MATH  Google Scholar 

  9. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  10. Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning

  11. Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F (2012) A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750

    Article  Google Scholar 

  12. Han Y, Zhao S, Liu M, Luo Y, Ding Y (2016) Multi-scale clustering mining algorithm. Comput Sci 43(8):244–248

    Google Scholar 

  13. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70

    MathSciNet  MATH  Google Scholar 

  14. Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11(1):63–90

    Article  MATH  Google Scholar 

  15. Jiang F, Sui Y (2015) A novel approach for discretization of continuous attributes in rough set theory. Knowl-Based Syst 73:324–334

    Article  Google Scholar 

  16. John GH, Langley P (2013) Estimating continuous distributions in bayesian classifiers, pp 338–345. arXiv:13024964

  17. Kerber R (1992) Chimerge: Discretization of numeric attributes. In: Proceedings of the tenth national conference on Artificial intelligence, pp 123–128

  18. Kurgan LA, Cios KJ (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153

    Article  Google Scholar 

  19. Li C, Zhao S, Zhao J, Gao L, Chi Y (2017) Scaling-up algorithm of multi-scale association rules. Comput Sci 44(08):285–289

    Google Scholar 

  20. Liu X, Jiang H, Wu D (2013) Improved algorithm based on cacc for discretization of continuous data [j]. Computer Engineering 4

  21. Liu M, Zhao S, Min C (2015) Scaling-up mining algorithm of multi-scale association rules mining. Appl Res Comput 32(10):2924–2929

    Google Scholar 

  22. Min H (2009) A global discretization and attribute reduction algorithm based on k-means clustering and rough sets theory. In: 2009 Second international symposium on knowledge acquisition and modeling, vol 2. IEEE, pp 92–95

  23. Ramírez-Gallego S, García S, et al. (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev Data Min Knowl Discov 6(1):5–21

    Article  Google Scholar 

  24. Sang Y, Li K, Shen Y (2010) Ebda: An effective bottom-up discretization algorithm for continuous attributes. In: 2010 10th IEEE International Conference on Computer and Information Technology. IEEE, pp 2455–2462

  25. Shi H, Fu J (2005) A global discretization method based on rough sets. In: 2005 International conference on machine learning and cybernetics, vol 5. IEEE, pp 3053–3057

  26. Thaiphan R, Phetkaew T (2018) Comparative analysis of discretization algorithms on decision tree. In: 2018 IEEE/ACIS 17Th international conference on computer and information science (ICIS). IEEE, pp 63–67

  27. Tsai CJ, Lee CI, Yang WP (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731

    Article  Google Scholar 

  28. Wen LY, Min F, Wang SY (2017) A two-stage discretization algorithm based on information entropy. Appl Intell 47(4):1169–1185

    Article  Google Scholar 

  29. Wong AK, Chiu DK (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence (6):796–805

  30. Wu X, Kumar V (2009) The top ten algorithms in data mining. CRC Press, Boca Raton

  31. Xie H, Cheng H, Niu D (2005) Discretization of continuous attributes in rough set theory based on information entropy. Chin J Comput 28(9):1570–1574

    Google Scholar 

  32. Xun Y, Zhang J, Qin X (2015) Fidoop: Parallel mining of frequent itemsets using mapreduce. IEEE Trans Syst Man Cybern Syst 46(3):313–325

    Article  Google Scholar 

  33. Xun Y, Zhang J, Qin X, Zhao X (2016) Fidoop-dp: Data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans Parallel Distrib Syst 28(1):101–114

    Article  Google Scholar 

  34. Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1):39–74

    Article  Google Scholar 

  35. Zhang J, Li X et al (2012) A soft discretization method of celestial spectrum characteristic line based on fuzzy c-means clustering. Spectrosc Spectr Anal 32(5):1435–1438

    Google Scholar 

  36. Zhang J, Feng C, Tang C (2018) Discretization algorithm based on genetic algorithm and variable precision rough set. J Central China Normal Univ 52(03):322–328

    Google Scholar 

  37. Zhang F, Zhao S, Wu Y (2019) Data scaling method for multi-scale data mining. Computer Science

  38. Zhao J, Zhou YH (2009) New heuristic method for data discretization based on rough set theory. Journal of China Universities of Posts and Teleconnunications (6):113–120

Download references

Funding

This work is supported by the National Natural Science Foundation of P.R. China (No.61602335, 61876122), Natural Science Foundation of Shanxi Province, P. R. China (No.201901D211302), Taiyuan University of Science and Technology Scientific Research Initial Funding of Shanxi Province, P. R. China (No.20172017), and Scientific and Technological Innovation Team of Shanxi Province, P. R. China (No. 201805D131007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaling Xun.

Ethics declarations

Conflict of interests

The authors declare that we have no conflict of interest.

Additional information

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xun, Y., Yin, Q., Zhang, J. et al. A novel discretization algorithm based on multi-scale and information entropy. Appl Intell 51, 991–1009 (2021). https://doi.org/10.1007/s10489-020-01850-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01850-w

Keywords

Navigation