Skip to main content
Log in

Classification optimization for training a large dataset with Naïve Bayes

  • Published:
Journal of Combinatorial Optimization Aims and scope Submit manuscript

Abstract

Book classification is very popular in digital libraries. Book rating prediction is crucial to improve the care of readers. The commonly used techniques are decision tree, Naïve Bayes (NB), neural networks, etc. Moreover, mining book data depends on feature selection, data pre-processing, and data preparation. This paper proposes the solutions of knowledge representation optimization as well as feature selection to enhance book classification and point out appropriate classification algorithms. Several experiments have been conducted and it has been found that NB could provide best prediction results. The accuracy and performance of NB can be improved and outperform other classification algorithms by applying appropriate strategies of feature selections, data type selection as well as data transformation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Amatriain X, Jaimes A, Oliver N, Pujol JM (2011) Data mining methods for recommender systems. In: Ricci F, Rokach L, Shapira B, Kantor PB (eds) Recommender systems handbook. Springer, Boston, pp 39–71

    Chapter  Google Scholar 

  • Frank E, Hall MA, Witten IH (2016) The WEKA workbench. Online appendix for “data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington

    Google Scholar 

  • Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the thirteenth international conference on international conference on machine learning, Bari, Italy. Morgan Kaufmann Publishers Inc, pp. 148–156

  • Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  Google Scholar 

  • Han J, Kamber M, Pei J (eds) (2012a) 2—Getting to know your data. In: Data mining, 3rd edn. Morgan Kaufmann, Boston, pp 39–82

    Chapter  Google Scholar 

  • Han J, Kamber M, Pei J (eds) (2012b) 3—Data preprocessing. In: Data mining, 3rd edn. Morgan Kaufmann, Boston, pp 83–124

    Chapter  Google Scholar 

  • Han J, Kamber M, Pei J (eds) (2012c) 9—Classification: advanced methods. In: Data mining, 3rd edn. Morgan Kaufmann, Boston, pp 393–442

    Chapter  Google Scholar 

  • Han J, Kamber M, Pei J (2012d) Data mining, 3rd edn. Morgan Kaufmann, Boston

    MATH  Google Scholar 

  • Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on international conference on machine learning, vol 32. Beijing, China, JMLR.org: II-1188-II-1196

  • Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv:1310.4546

  • Nguyen TTS (2019) Model-based book recommender systems using Naive Bayes enhanced with optimal feature selection. In: Proceedings of the 2019 8th international conference on software and computer applications, Penang, Malaysia. ACM, pp 217–222

  • Novakovic J (2010) The impact of feature selection on the accuracy of Naive Bayes classifier. In: 18th telecommunications forum TELFOR, Serbia, Belgrade

  • Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, Association for Computational Linguistics

  • Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, Boston

    Google Scholar 

  • Ratanamahatana C, Gunopulos D (2003) Feature selection for the naive bayesian classifier using decision trees. Appl Artif Intell 17(5–6):475–487

    Article  Google Scholar 

  • Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer, Boston, pp 532–538

    Chapter  Google Scholar 

  • Shi H, Liu Y (2011) Naïve Bayes vs. support vector machine: resilience to missing data. Springer, Berlin

    Google Scholar 

  • Taheri S, Mammadov M (2013) Learning the Naive Bayes classifier with optimization models. Int J Appl Math Comput Sci 23(4):787–795

    Article  MathSciNet  Google Scholar 

  • Tin Kam H (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition

  • Witten IH, Frank E, Hall MA (2011a) Chapter 2—Input: concepts, instances, and attributes. In: Witten IH, Frank E, Hall MA (eds) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Boston, pp 39–60

    Chapter  Google Scholar 

  • Witten IH, Frank E, Hall MA (2011b) Chapter 5—Credibility: evaluating what’s been learned. In: Witten IH, Frank E, Hall MA (eds) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Boston, pp 147–187

    Chapter  Google Scholar 

  • Witten IH, Frank E, Hall MA (2011c) Chapter 7 - Data Transformations. In: Witten IH, Frank E, Hall MA (eds) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Boston, pp 305–349

    Chapter  Google Scholar 

  • Xhemali D, Hinde CJ, Stone RG (2009) Naïve Bayes vs. decision trees vs. neural networks in the classification of training web pages. Int J Comput Sci Issues 4(1):16–23

    Google Scholar 

  • Xu W, Jiang L, Yu L (2018) An attribute value frequency-based instance weighting filter for Naive Bayes. J Exp Theor Artif Intell 31:1–12

    Google Scholar 

  • Yu H (2009) Support vector machine. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer, Boston, pp 2890–2892

    Chapter  Google Scholar 

Download references

Acknowledgements

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant Number: 06/2018/TN.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thi Thanh Sang Nguyen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, T.T.S., Do, P.M.T. Classification optimization for training a large dataset with Naïve Bayes. J Comb Optim 40, 141–169 (2020). https://doi.org/10.1007/s10878-020-00578-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10878-020-00578-0

Keywords

Navigation