Abstract
We present a general-purpose multi-stage, multi-group machine learning framework that incorporates the discriminant analysis via mixed integer programming (DAMIP) classifier with an exact combinatorial branch-and-bound (BB) algorithm and a fast particle swarm optimization (PSO) for feature selection. DAMIP delays making decisions on ‘difficult-to-classify’ observations by placing them into a reserved judgment region and develops new classification rules in a later stage. Such a design is well-suited for poorly separated data that are difficult to classify without committing a high percentage of misclassification errors. The model misclassification limits, and reserved judgment levels can be fine-tuned to facilitate the efficient management of imbalanced groups. This ensures that the minority groups (with relatively few entities) are treated equally as the majority groups. We tackle four medical problems that involve poorly separated data and imbalanced groups in which traditional classifiers yield low prediction accuracy: (a) multi-site treatment outcome prediction for best practice discovery in cardiovascular disease; and (b) diabetes; (c) early disease diagnosis in predicting subjects into normal cognition, mild cognitive impairment, and Alzheimer’s disease groups using neuropsychological tests and blood plasma biomarkers; and (d) uncovering patient characteristics that predict optimal response to intra-articular injections of hyaluronic acid for knee osteoarthritis. The multi-stage BB-PSO/DAMIP returns interpretable predictive results with over 80% blind prediction accuracy. One advantage of our findings is that the features identified are easily interpreted and understood by clinicians as well as patients. All of which can have a significant impact on translating the findings to clinical practice to achieve an improved quality of life and medical outcome. The multiple rules with relatively small subsets of discriminatory features afford flexibility for different sites (and different patient populations) to adopt different policies for implementing the best practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lee, E.K., Egan, B.M.: A multi-stage multi-group classification model: applications to knowledge discovery for evidence-based patient-centered care. In: Proceedings of the 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, vol. 1, pp. 95–108 (2022). KDIR. ISBN 978-989-758-614-9. ISSN 2184-3228
Lee, E.K., Wang, Y., Hagen, M.S., Wei, X., Davis, R.A., Egan, B.M.: Machine learning: multi-site evidence-based best practice discovery. In: Pardalos, P.M., Conca, P., Giuffrida, G., Nicosia, G. (eds.) MOD 2016. LNCS, vol. 10122, pp. 1–15. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-51469-7_1
Rose, S.: Machine learning for prediction in electronic health data. JAMA Netw. Open 1(4) (2018). https://doi.org/10.1001/jamanetworkopen.2018.1404
Marlin, B.M., Zemel, R.S., Roweis, S.T., Slaney, M.: Recommender systems: missing data and statistical model estimation. In: IJCAI International Joint Conference on Artificial Intelligence (2011). https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-447
McDermott, M.B.A., Yan, T., Naumann, T., Hunt, N., Suresh, H., Szolovits, P., Ghassemi, M.: Semi-supervised biomedical translation with cycle Wasserstein regression GaNs. In: 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 (2018). https://doi.org/10.1609/aaai.v32i1.11890
Mohan, K., Pearl, J., Tian, J.: Graphical models for inference with missing data. In: Advances in Neural Information Processing Systems (2013)
Rajkomar, A., Hardt, M., Howell, M.D., Corrado, G., Chin, M.H.: Ensuring fairness in machine learning to advance health equity. Ann. Internal Med. 169(12) (2018). https://doi.org/10.7326/M18-1990
Lee, E.K., Wang, Y., He, Y., Egan, B.M.: An efficient, robust, and customizable information extraction and pre-processing pipeline for electronic health records. In: IC3K 2019 - Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, vol. 1 (2019). https://doi.org/10.5220/0008071303100321
Lee, E.K., Egan, B.M.: Free text to standardized concepts to clinical decisions. In: Wang, J. (ed.) Encyclopedia of Data Science and Machine Learning. IGI Global (2022)
Lee, E.K., Yuan, F., Hirsh, D.A., Mallory, M.D., Simon, H.K.: A clinical decision tool for predicting patient care characteristics: patients returning within 72 hours in the emergency department. In: AMIA Annual Symposium Proceedings/AMIA Symposium. AMIA Symposium 2012 (2012)
Suresh, H., et al.: Proceedings of Machine Learning for Healthcare 2017 Clinical Intervention Prediction and Understanding with Deep Neural Networks. Ml4H, 68 (2017)
Basha, S.J., Madala, S.R., Vivek, K., Kumar, E.S., Ammannamma, T.: A review on imbalanced data classification techniques. In: 2022 International Conference on Advanced Computing Technologies and Applications (ICACTA), pp. 1–6 (2022). https://doi.org/10.1109/ICACTA54488.2022.9753392
Fujiwara, K., et al.: Over- and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis. Front. Public Health 8, 178 (2020). https://doi.org/10.3389/fpubh.2020.00178
Gao, L., Zhang, L., Liu, C., Wu, S.: Handling imbalanced medical image data: a deep-learning-based one-class classification approach. Artif. Intell. Med. 108 (2020). https://doi.org/10.1016/j.artmed.2020.101935
O’Leary, L.: How IBM’s Watson Went From the Future of Health Care to Sold Off for Parts. https://slate.com/technology/2022/01/ibm-watson-health-failure-artificial-intelligence.html. Accessed 22 Jan 2023
Sweeney, E.: Experts say IBM Watson’s flaws are rooted in data collection and interoperability. https://www.fiercehealthcare.com/analytics/ibm-watson-s-flaws-trace-back-to-data-collection-interoperability. Accessed 23 Jan 2023
Lee, E.K., Li, Z., Wang, Y., Hagen, M.S., Davis, R., Egan, B.M.: Multi-site best practice discovery: from free text to standardized concepts to clinical decisions. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2766–2773 (2021). https://doi.org/10.1109/BIBM52615.2021.9669414
Ghassemi, M., Naumann, T., Schulam, P., Beam, A.L., Chen, I.Y., Ranganath, R.: A review of challenges and opportunities in machine learning for health. In: AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science, 2020 (2020)
Cui, L., Yang, S., Chen, F., Ming, Z., Lu, N., Qin, J.: A survey on application of machine learning for Internet of Things. Int. J. Mach. Learn. Cybern. 9(8) (2018). https://doi.org/10.1007/s13042-018-0834-5
Dixon, M.F., Halperin, I., Bilokon, P.: Machine learning in finance: from theory to practice. In: Machine Learning in Finance: From Theory to Practice (2020). https://doi.org/10.1007/978-3-030-41068-1
Hayward, K.J., Maas, M.M.: Artificial intelligence and crime: a primer for criminologists. Crime Media Cult. 17(2) (2021). https://doi.org/10.1177/1741659020917434
Lei, Y., Yang, B., Jiang, X., Jia, F., Li, N., Nandi, A.K.: Applications of machine learning to machine fault diagnosis: a review and roadmap. Mech. Syst. Signal Process. 138 (2020). https://doi.org/10.1016/j.ymssp.2019.106587
Myszczynska, M.A., et al.: Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol. 16(8) (2020). https://doi.org/10.1038/s41582-020-0377-8
Narciso, D.A.C., Martins, F.G.: Application of machine learning tools for energy efficiency in industry: a review. Energy Rep. 6 (2020). https://doi.org/10.1016/j.egyr.2020.04.035
Qu, K., Guo, F., Liu, X., Lin, Y., Zou, Q.: Application of machine learning in microbiology. Front. Microbiol. 10(Apr) (2019). https://doi.org/10.3389/fmicb.2019.00827
Yarkoni, T., Westfall, J.: Choosing prediction over explanation in psychology: lessons from machine learning. Perspect. Psychol. Sci. 12(6) (2017). https://doi.org/10.1177/1745691617693393
Zhao, S., et al.: Application of machine learning in intelligent fish aquaculture: a review. Aquaculture 540 (2021). https://doi.org/10.1016/j.aquaculture.2021.736724
Efron, B., et al.: Least angle regression. Ann. Stat. 32(2) (2004). https://doi.org/10.1214/009053604000000067
Tibshirani, R.: Regression shrinkage and selection via the lasso: a retrospective. J. Roy. Stat. Soc. Ser. B Stat. Methodol. 73(3) (2011). https://doi.org/10.1111/j.1467-9868.2011.00771.x
Hocking, R.R., Leslie, R.N.: Selection of the best subset in regression analysis. Technometrics 9(4) (1967). https://doi.org/10.1080/00401706.1967.10490502
Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognit. Lett. 15(11) (1994). https://doi.org/10.1016/0167-8655(94)90127-9
Silva, A.P.D., Stam, A.: Second order mathematical programming formulations for discriminant analysis. Eur. J. Oper. Res. 72(1) (1994). https://doi.org/10.1016/0377-2217(94)90324-7
Siedlecki, W., Sklansky, J.: A note on genetic algorithms for large-scale feature selection. Pattern Recognit. Lett. 10(5) (1989). https://doi.org/10.1016/0167-8655(89)90037-8
Kennedy, J., Eberhart, R.C.: Discrete binary version of the particle swarm algorithm. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 5 (1997). https://doi.org/10.1109/icsmc.1997.637339
Agrafiotis, D.K., Cedeño, W.: Feature selection for structure-activity correlation using binary particle swarms. J. Med. Chem. 45(5) (2002). https://doi.org/10.1021/jm0104668
Correa, E.S., Freitas, A.A., Johnson, C.G.: A new discrete particle swarm algorithm applied to attribute selection in a bioinformatics data set. In: GECCO 2006 - Genetic and Evolutionary Computation Conference, vol. 1 (2006). https://doi.org/10.1145/1143997.1144003
Hu, Y., Zhang, Y., Gong, D.: Multiobjective particle swarm optimization for feature selection with fuzzy cost. IEEE Trans. Cybern. 51(2) (2021). https://doi.org/10.1109/TCYB.2020.3015756
Jain, N.K., Nangia, U., Jain, J.: A review of particle swarm optimization. J. Inst. Eng. (India): Ser. B 99(4) (2018). https://doi.org/10.1007/s40031-018-0323-y
Monteiro, S.T., Kosugi, Y.: Particle swarms for feature extraction of hyperspectral data. IEICE Trans. Inf. Syst. E90-D(7) (2007). https://doi.org/10.1093/ietisy/e90-d.7.1038
Gallagher, R.J., Lee, E.K., Patterson, D.A.: Constrained discriminant analysis via 0/1 mixed integer programming. Ann. Oper. Res. 74 (1997). https://doi.org/10.1023/a:1018943025993
World Health Organization. Cardiovascular diseases (2022). https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1. Accessed 23 Jan 2023
Tsao, C.W., et al.: Heart disease and stroke statistics-2022 update: a report from the American heart association. Circulation 145(8), e153–e639 (2022). https://doi.org/10.1161/CIR.0000000000001052. Epub 2022 Jan 26. Erratum in: Circulation. 2022 Sep 6;146(10):e141. PMID: 35078371
Cardiovascular diseases affect nearly half of American adults, statistics show. American Heart Association News (2019). https://www.heart.org/en/news/2019/01/31/cardiovascular-diseases-affect-nearly-half-of-american-adults-statistics-show
Gordon, T., Castelli, W.P., Hjortland, M.C., Kannel, W.B., Dawber, T.R.: High density lipoprotein as a protective factor against coronary heart disease. The Framingham study. Am. J. Med. 62(5) (1977). https://doi.org/10.1016/0002-9343(77)90874-9
Nwegbu, N., Tirunagari, S., Windridge, D.: A novel kernel based approach to arbitrary length symbolic data with application to type 2 diabetes risk. Sci. Rep. 12(1) (2022). https://doi.org/10.1038/s41598-022-08757-1
Ogurtsova, K., et al.: IDF diabetes atlas: global estimates for the prevalence of diabetes for 2015 and 2040. Diabetes Res. Clin. Pract. 128 (2017)
Riddle, M.C., Herman, W.H.: The cost of diabetes care—an elephant in the room. Diabetes Care 41, 929–932 (2018)
American Diabetes Association. Statistics About Diabetes (2022). https://diabetes.org/about-us/statistics/about-diabetes
American Diabetes Association. Economic Costs of Diabetes in the U.S. in 2017. Diabetes Care 41(5), 917–928 (2018). https://doi.org/10.2337/dci18-0007. PMID 29567642; PMCID PMC5911784
Nathan, D.M., et al.: Diabetes control and complications trial/epidemiology of diabetes interventions and complications (DCCT/EDIC) study research group. Intensive diabetes treatment and cardiovascular disease in patients with type 1 diabetes. N. Engl. J. Med. 353(25), 2643–2653 (2005). https://doi.org/10.1056/NEJMoa052187. PMID 16371630; PMCID PMC2637991
Caiado, J., Crato, N., Peña, D.: Comparison of times series with unequal length in the frequency domain. Commun. Stat. Simul. Comput.® 38(3), 527–540 (2009)
World Health Organization. The top 10 causes of death (2022). https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death. Accessed 24 Jan 2023
Kluger, A., Ferris, S.H., Golomb, J., Mittelman, M.S., Reisberg, B.: Neuropsychological prediction of decline to dementia in nondemented elderly. J. Geriatr. Psychiatry Neurol. 12(4) (1999). https://doi.org/10.1177/089198879901200402
Lopez, O.L., et al.: Neuropsychological characteristics of mild cognitive impairment subgroups. J. Neurol. Neurosurg. Psychiatry 77(2) (2006). https://doi.org/10.1136/jnnp.2004.045567
Lee, E.K., Wu, T.L.: Classification and disease prediction via mathematical programming. In: Springer Optimization and Its Applications, vol. 26 (2009). https://doi.org/10.1007/978-0-387-09770-1_12
Lee, E.K., Wu, T.L., Goldstein, F., Levey, A.: Predictive model for early detection of mild cognitive impairment and Alzheimer’s disease. Fields Inst. Commun. 63 (2012). https://doi.org/10.1007/978-1-4614-4133-5_4
Stuss, D.T., Trites, R.L.: Classification of neurological status using multiple discriminant function analysis of neuropsychological test scores. J. Consult. Clin. Psychol. 45(1) (1977). https://doi.org/10.1037/0022-006X.45.1.145
Tabert, M.H., et al.: Neuropsychological prediction of conversion to Alzheimer disease in patients with mild cognitive impairment. Arch. Gen. Psychiatry 63(8) (2006). https://doi.org/10.1001/archpsyc.63.8.916
Hu, W.T., et al.: Plasma multianalyte profiling in mild cognitive impairment and Alzheimer Disease. Neurology 79(9) (2012). https://doi.org/10.1212/WNL.0b013e318266fa70
Hu, W.T., et al.: CSF complement 3 and factor H are staging biomarkers in Alzheimer’s disease. Acta Neuropathol. Commun. 4 (2016). https://doi.org/10.1186/s40478-016-0277-8
Palmqvist, S., et al.: Discriminative accuracy of plasma phospho-tau217 for Alzheimer disease vs other neurodegenerative disorders. JAMA J. Am. Med. Assoc. 324(8) (2020). https://doi.org/10.1001/jama.2020.12134
Ray, S., et al.: Classification and prediction of clinical Alzheimer’s diagnosis based on plasma signaling proteins. Nat. Med. 13(11) (2007). https://doi.org/10.1038/nm1653
Reddy, M.M., et al.: Identification of candidate IgG biomarkers for Alzheimer’s disease via combinatorial library screening. Cell 144(1) (2011). https://doi.org/10.1016/j.cell.2010.11.054
Rocha de Paula, M.R., Gómez Ravetti, M., Berretta, R., Moscato, P.: Differences in abundances of cell-signalling proteins in blood reveal novel biomarkers for early detection of clinical Alzheimer’s disease. PLoS ONE 6(3) (2011). https://doi.org/10.1371/journal.pone.0017481
Schindler, S.E., Bateman, R.J.: Combining blood-based biomarkers to predict risk for Alzheimer’s disease dementia. Nat. Aging 1(1) (2021). https://doi.org/10.1038/s43587-020-00008-0
Riddle, D.L., Jiranek, W.A., Hayes, C.W.: Use of a validated algorithm to judge the appropriateness of total knee arthroplasty in the united states: a multicenter longitudinal cohort study. Arthritis Rheumatol. 66(8), 2134–2143 (2014)
Mora, J.C., Przkora, R., Cruz-Almeida, Y.: Knee osteoarthritis: pathophysiology and current treatment modalities. J. Pain Res. 11, 2189–2196 (2018). https://doi.org/10.2147/JPR.S154002. PMID: 30323653; PMCID: PMC6179584.
Bellamy, N.: WOMAC Osteoarthritis Index User Guide. Version V. Brisbane, Australia (2002)
Hays, R.D., Sherbourne, C.D., Mazel, R.M.: The RAND 36-item health survey 1.0. Health Econ. 2(3), 217–227 (1993)
Marx, R.G., Stump, T.J., Jones, E.C., Wickiewicz, T.L., Warren, R.F.: Development and evaluation of an activity rating scale for disorders of the knee. Am. J. Sports Med. 29, 213–218 (2001)
Sangha, O., Stucki, G., Liang, M.H., Fossel, A.H., Katz, J.N.: The self-administered comorbidity questionnaire: a new method to assess comorbidity for clinical and health services research. Arthritis Rheum. 49, 156–163 (2003)
Brooks, R.: EuroQol: the current state of play. Health Policy 37(1), 53–72 (1996)
Lorig, K., Chastain, R.L., Ung, E., Shoor, S., Holman, H.R.: Development and evaluation of a scale to measure perceived self-efficacy in people with arthritis. Arthritis Rheum. 32, 37–44 (1989)
Ebrahimzadeh, M.H., Makhmalbaf, H., Birjandinejad, A., Keshtan, F.G., Hoseini, H.A., Mazloumi, S.M.: The western Ontario and Mcmaster universities osteoarthritis index (WOMAC) in Persian speaking patients with knee osteoarthritis. Arch. Bone Jt. Surg. 2(1), 57–62 (2014). PMID 25207315; PMCID PMC4151432
Hochberg, M.C., Altman, R.D., Brandt, K.D., Moskowitz, R.W.: Design and conduct of clinical trials in osteoarthritis: preliminary recommendations from a task force of the osteoarthritis research society. J. Rheumatol. 24, 792–794 (1997)
Lee, E.K., Mann, B.J., DeMaio, M.: Prediction of responses to intra-articular injections of Hyaluronic acid for knee osteoarthritis. Preprint (2023)
Lee, E.K., Gallagher, R.J., Patterson, D.A.: A linear programming approach to discriminant analysis with a reserved-judgment region. INFORMS J. Comput. 15(1) (2003). https://doi.org/10.1287/ijoc.15.1.23.15158
Shapoval, A., Lee, E.K.: Generalizing 0–1 conflict hypergraphs and mixed conflict graphs: mixed conflict hypergraphs in discrete optimization. J. Glob. Optim. 80(4) (2021). https://doi.org/10.1007/s10898-021-01012-3
Acknowledgements
A portion of the results from this project (the machine learning advances, and the results obtained for cardiovascular disease and diabetes) received the first runner-up prize at the 2019 Caterpillar and INFORMS Innovative Applications in Analytics Award. This work is partially supported by grants from the National Science Foundation (IIP-1361532), and the American Orthopedic Society for Sports Medicine. Findings and conclusions in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation and the American Orthopedic Society for Sports Medicine. The authors would like to acknowledge the participation of Zhuonan Li in this project. The authors also thank Dr. Allan Levey, Dr. Felicia Goldstein, and Dr. William Hu of Emory Alzheimer’s Disease Research Center for their collaboration and clinical advice. The authors extend their deepest respect and gratitude to the late Dr. Barton J. Mann PhD, with whom we collaborated on the knee osteoarthritis research, and to Dr. Captain Marlene DeMaio for her clinical guidance and collaboration on the project. We thank the anonymous reviewers for their useful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lee, E.K., Yuan, F., Man, B.J., Egan, B. (2023). A General-Purpose Multi-stage Multi-group Machine Learning Framework for Knowledge Discovery and Decision Support. In: Coenen, F., et al. Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2022. Communications in Computer and Information Science, vol 1842. Springer, Cham. https://doi.org/10.1007/978-3-031-43471-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-43471-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43470-9
Online ISBN: 978-3-031-43471-6
eBook Packages: Computer ScienceComputer Science (R0)