Breast tumor prediction and feature importance score finding using machine learning algorithms

Sk. Shalauddin Kabir, Md. Sabbir Ahmmed, Md. Moradul Siddique, Romana Rahman Ema, Motiur Rahman, Syed Md. Galib

Abstract


The subject matter of this study is breast tumor prediction and feature importance score finding using machine learning algorithms. The goal of this study was to develop an accurate predictive model for identifying breast tumors and determining the importance of various features in the prediction process.  The tasks undertaken included collecting and preprocessing the Wisconsin Breast Cancer original dataset (WBCD). Dividing the dataset into training and testing sets, training using machine learning algorithms such as Random Forest, Decision Tree (DT), Logistic Regression, Multi-Layer Perceptron, Gradient Boosting Classifier, Gradient Boosting Classifier (GBC), and K-Nearest Neighbors, evaluating the models using performance metrics, and calculating feature importance scores. The methods used involve data collection, preprocessing, model training, and evaluation. The outcomes showed that the Random Forest model is the most reliable predictor with 98.56 % accuracy. A total of 699 instances were found, and 461 instances were reached using data optimization methods. In addition, we ranked the top features from the dataset by feature importance scores to determine how they affect the classification models. Furthermore, it was subjected to a 10-fold cross-validation process for performance analysis and comparison. The conclusions drawn from this study highlight the effectiveness of machine learning algorithms in breast tumor prediction, achieving high accuracy and robust performance metrics. In addition, the analysis of feature importance scores provides valuable insights into the key indicators of breast cancer development. These findings contribute to the field of breast cancer diagnosis and prediction by enhancing early detection and personalized treatment strategies and improving patient outcomes.

Keywords


Breast tumor; Benign; Classification model; Machine learning; Tumor; Malignant; Data optimization

Full Text:

PDF

References


Definition of tumor, NCI Dictionary of Cancer Terms. Available at: https://www.cancer.gov/publications/dictionaries/cancer-terms/def/tumor (accessed: Feb. 23, 2023).

What Is Cancer? Available at: https://www.cancer.gov/about-cancer/understanding/what-is-cancer/ (accessed October 11, 2021).

Testa, U., Castelli, G., & Pelosi, E. Breast cancer: a molecularly heterogenous disease needing subtype-specific treatments. Medical Sciences, 2020, vol. 8, no. 1, article no. 18. DOI: 10.3390/medsci8010018.

Breast Cancer Facts and Statistics. Available at: https://www.breastcancer.org/facts-statistics (Accessed on Jan. 19, 2023).

Gayathri, B. M., Sumathi, C. P., & Santhanam, T. Breast cancer diagnosis using machine learning algorithms – a survey. International Journal of Distributed and Parallel Systems (IJDPS), 2013, vol. 4, iss. 3, pp. 105-112. DOI: 10.5121/ijdps.2013.4309.

Nemade, V., Pathak, S., & Dubey, A. K. A systematic literature review of breast cancer diagnosis using machine intelligence techniques. Archives of Computational Methods in Engineering, 2022, vol. 29, no. 6, pp. 4401-4430. DOI: 10.1007/s11831-022-09738-3.

Elsadig, M. A., Altigani, A., & Elshoush, H. T. Breast cancer detection using machine learning approaches: a comparative study. International Journal of Electrical & Computer Engineering, 2023, vol. 13, no. 1, pp. 736-745. DOI: 10.11591/ijece.v13i1.pp736-745.

Mangasarian, O. L., & Wolberg, W. H. Cancer diagnosis via linear programming. University of Wisconsin-Madison. Computer Sciences Department, 1990. 5 p. Available at: http://digital.library.wisc.edu/1793/59346. (Accessed on Dec. 23, 2022).

Lee, H., Yoon, T. J., Figueiredo, J. L., Swirski, F. K., & Weissleder, R. Rapid detection and profiling of cancer cells in fine-needle aspirates. Proceedings of the National Academy of Sciences, 2009, vol. 106, no. 30, pp. 12459-12464. DOI: 10.1073/pnas.0902365106.

Ara, S., Das, A., & Dey, A. Malignant and benign breast cancer classification using machine learning algorithms. In 2021 International Conference on Artificial Intelligence (ICAI), Islamabad, Pakistan, 2022, pp. 97-101. DOI: 10.1109/ICAI52203.2021.9445249.

Chaurasia, V., Pal, S., & Tiwari, B. B. Prediction of benign and malignant breast cancer using data mining techniques. Journal of Algorithms & Computational Technology, 2018, vol. 12, no. 2, pp. 119-126. DOI: 10.1177/1748301818756225.

Li, Y., & Chen, Z. Performance evaluation of machine learning methods for breast cancer prediction. Appl Comput Math, 2018, vol. 7, no. 4, pp. 212-216. DOI: 10.11648/j.acm.20180704.15.

Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Gomes, M., Seiça, R., & Caramelo, F. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer, 2018, vol. 18, no. 1, article no. 29, pp. 1-8. DOI: 10.1186/s12885-017-3877-1.

Asri, H., Mousannif, H., Al Moatassime, H., & Noel, T. Using machine learning algorithms for breast cancer risk prediction and diagnosis. Procedia Computer Science, 2016, vol. 83, pp. 1064-1069. DOI: 10.1016/j.procs.2016.04.224.

Wolberg, W. Breast Cancer Wisconsin (Original). Dataset. UCI Machine Learning Repository, 1992. DOI: 10.24432/C5HP4Z.

Kurn, H., & Daly, D. T. Histology, epithelial cell, StatPearls - NCBI BookShelf. Available at: https://www.ncbi.nlm.nih.gov/books/NBK559063/ (Accessed on Feb. 17, 2023).

What is ffill and bfill in pandas? Available at: https://www.projectpro.io/recipes/what-is-ffill-and-bfill-pandas (Accessed on Dec. 23, 2022).

Kumar, S., & Chong, I. Correlation analysis to identify the effective data in machine learning: Prediction of depressive disorder and emotion states. International journal of environmental research and public health, 2018, vol. 15, no. 12, article no. 2907. DOI: 10.3390/ijerph15122907.

Pothuganti, S. Review on over-fitting and under-fitting problems in Machine Learning and solutions. Int. J. Adv. Res. Electr. Electron. Instrumentation Eng, 2018 vol. 7, no. 9, pp. 3692-3695. Available at: http://www.ijareeie.com/upload/2018/september/11A_PS_NC.PDF. (Accessed on Feb. 17, 2023). DOI: 10.15662/IJAREEIE.2018.0709015.

Montesinos López, O. A., Montesinos López, A., & Crossa, J. Overfitting, Model Tuning, and Evaluation of Prediction Performance. In Multivariate statistical machine learning methods for genomic prediction, 2022, pp. 109-139. Cham: Springer International Publishing. DOI: 10.1007/978-3-030-89010-0_4.

Martyniuk, T., Krukivskyi, B., Kupershtein, L., & Lukichov, V. Neural Network model of heteroassociative memory for the classification task. Radioelectronic and Computer Systems, 2022, vol. 2, pp. 108-117. DOI: 10.32620/reks.2022.2.09.

Krivtsov, S., Meniailov, I., Bazilevych, K., & Chumachenko, D. Predictive model of COVID-19 epidemic process based on neural network. Radioelectronic and Computer Systems, 2022, vol. 4, pp. 7-18. DOI: 10.32620/reks.2022.4.01.

Tarle, B., & Akkalaksmi, M., Improving classification performance of neuro fuzzy classifier by imputing missing data. International Journal of Computing, 2019, vol. 18, iss. 4, pp. 495-501. DOI: 10.47839/ijc.18.4.1619.

Striuk, O., & Kondratenko, Yu. Generative adversarial neural networks and deep learning: successful cases and advanced approaches. International Journal of Computing, 2021, vol. 20, iss. 3, pp. 339-349. DOI: 10.47839/ijc.20.3.2278.




DOI: https://doi.org/10.32620/reks.2023.4.03

Refbacks

  • There are currently no refbacks.