Abstract
The exponential growth of hospital information systems (HIS) has led to the accumulation of vast amounts of medical data, necessitating effective analysis methods to enhance the quality and efficiency of medical services. Machine learning has emerged as a valuable technology for the automated and accurate analysis of medical data, offering potential applications in disease diagnosis and treatment. This study aims to contribute to the advancement of classification methods and address data imbalance issues in the context of hematological data. Specifically, we propose an efficient algorithm for disease classification utilizing hemogram blood test samples, employing the random forest algorithm in conjunction with the synthetic minority oversampling technique. Experimental results using real hematological data from a local hospital demonstrate the superiority of the proposed method, achieving an impressive accuracy rate of up to 97.75% and an Area Under the Curve value of up to 98.65%. The findings underscore the value of leveraging machine learning techniques in diagnoses and treatment in clinical practice, especially when integrated into HIS systems.
References
Akhtar A et al (2021) COVID-19 detection from CBC using machine learning techniques. Int J Technol Innov Manag IJTIM 1(2):65–78
Akter F et al (2018) Classification of hematological data using data mining technique to predict diseases. J Comput Commun 6(4):76
Alsheref FK, Gomaa WH (2019) Blood diseases detection using classical machine learning algorithms. Int J Adv Comput Sci Appl 10:7
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L et al (1984) Classification and regression T rees (Monterey, California: Wadsworth). Inc
Breiman L (2001) Random forests. Mach Learn 45:5–32
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol TIST 2(3):27
Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Cortes C, Vapnik V (1995) Support vector machine. Mach Learn 20(3):273–297
Deo RC (2015) Machine learning in medicine. Circulation 132(20):1920–1930
Doewes RI et al (2022) Diagnosis of COVID-19 through blood sample using ensemble genetic algorithms and machine learning classifier. World J Eng 19(2):175–182
Fix E, Hodges J (1952) Discriminatory analysis-nonparametric discrimination: Small sample performance. California Univ, Berkeley
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat, 1189–1232
Huynh P-H et al (2021) Enhancing COVID-19 prediction using transfer learning from Chest X-ray images. In: 2021 8th NAFOSTED conference on information and computer science (NICS), pp. 398–403. IEEE
Huynh P-H et al (2019) Enhancing gene expression classification of support vector machines with generative adversarial networks. J Inf Commun Converg Eng 17(1):14–20
Huynh P-H et al (2020) Improvements in the large p, small n classification issue. SN Comput Sci 1:1–19
Huynh PH, Nguyen VH (2023) A novel ensemble of support vector machines for improving medical data classification. Eng Innov 4:47–66
Kalantari A et al (2018) Computational intelligence approaches for classification of medical data: state-of-the-art, future challenges and research directions. Neurocomputing 276:2–22
L Breiman RAO, J Friedman CJ (1984) Stone: classification and regression trees. Wadsworth Int Group 8:452–456
MacEachern SJ, Forkert ND (2021) Machine learning for precision medicine. Genome 64(4):416–425
Obstfeld AE (2023) Hematology and machine learning. J Appl Lab Med 8(1):129–144
Qi Y (2012) Random forest for bioinformatics. Ensemble Mach Learn Methods Appl, 307–323
Rahman MM, Davis DN (2013) Addressing the class imbalance problem in medical datasets. Int J Mach Learn Comput 3(2):224
Vijayarani S, Sudha S (2015) An efficient clustering algorithm for predicting diseases from hemogram blood test samples. Indian J Sci Technol 8(17):1
Vinisha FA, Sujihelen L (2022) Study on missing values and outlier detection in concurrence with data quality enhancement for efficient data processing. In: 2022 4th international conference on smart systems and inventive technology (ICSSIT), pp 1600–1607 IEEE
Vujović Z (2021) Classification model evaluation metrics. Int J Adv Comput Sci Appl 12(6):599–606
Wang Q et al (2018) An efficient random forests algorithm for high dimensional data classification. Adv Data Anal Classif, 1–20
Zhu M et al (2018) Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access. 6:4641–4652
Zini G (2005) Artificial intelligence in hematology. Hematology 10(5):393–400
Acknowledgment
This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number C2024-16-02.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Huynh, PH., Nguyen, NM., Tran, TN., Doan, TN. (2024). Improvements in the Imbalanced Hemogram Data Classification. In: Triwiyanto, T., Rizal, A., Caesarendra, W. (eds) Proceedings of the 4th International Conference on Electronics, Biomedical Engineering, and Health Informatics. ICEBEHI 2023. Lecture Notes in Electrical Engineering, vol 1182. Springer, Singapore. https://doi.org/10.1007/978-981-97-1463-6_23
Download citation
DOI: https://doi.org/10.1007/978-981-97-1463-6_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-1462-9
Online ISBN: 978-981-97-1463-6
eBook Packages: EngineeringEngineering (R0)