Skip to main content

A General-Purpose Multi-stage Multi-group Machine Learning Framework for Knowledge Discovery and Decision Support

  • Conference paper
  • First Online:
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2022)

Abstract

We present a general-purpose multi-stage, multi-group machine learning framework that incorporates the discriminant analysis via mixed integer programming (DAMIP) classifier with an exact combinatorial branch-and-bound (BB) algorithm and a fast particle swarm optimization (PSO) for feature selection. DAMIP delays making decisions on ‘difficult-to-classify’ observations by placing them into a reserved judgment region and develops new classification rules in a later stage. Such a design is well-suited for poorly separated data that are difficult to classify without committing a high percentage of misclassification errors. The model misclassification limits, and reserved judgment levels can be fine-tuned to facilitate the efficient management of imbalanced groups. This ensures that the minority groups (with relatively few entities) are treated equally as the majority groups. We tackle four medical problems that involve poorly separated data and imbalanced groups in which traditional classifiers yield low prediction accuracy: (a) multi-site treatment outcome prediction for best practice discovery in cardiovascular disease; and (b) diabetes; (c) early disease diagnosis in predicting subjects into normal cognition, mild cognitive impairment, and Alzheimer’s disease groups using neuropsychological tests and blood plasma biomarkers; and (d) uncovering patient characteristics that predict optimal response to intra-articular injections of hyaluronic acid for knee osteoarthritis. The multi-stage BB-PSO/DAMIP returns interpretable predictive results with over 80% blind prediction accuracy. One advantage of our findings is that the features identified are easily interpreted and understood by clinicians as well as patients. All of which can have a significant impact on translating the findings to clinical practice to achieve an improved quality of life and medical outcome. The multiple rules with relatively small subsets of discriminatory features afford flexibility for different sites (and different patient populations) to adopt different policies for implementing the best practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lee, E.K., Egan, B.M.: A multi-stage multi-group classification model: applications to knowledge discovery for evidence-based patient-centered care. In: Proceedings of the 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, vol. 1, pp. 95–108 (2022). KDIR. ISBN 978-989-758-614-9. ISSN 2184-3228

    Google Scholar 

  2. Lee, E.K., Wang, Y., Hagen, M.S., Wei, X., Davis, R.A., Egan, B.M.: Machine learning: multi-site evidence-based best practice discovery. In: Pardalos, P.M., Conca, P., Giuffrida, G., Nicosia, G. (eds.) MOD 2016. LNCS, vol. 10122, pp. 1–15. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-51469-7_1

    Chapter  Google Scholar 

  3. Rose, S.: Machine learning for prediction in electronic health data. JAMA Netw. Open 1(4) (2018). https://doi.org/10.1001/jamanetworkopen.2018.1404

  4. Marlin, B.M., Zemel, R.S., Roweis, S.T., Slaney, M.: Recommender systems: missing data and statistical model estimation. In: IJCAI International Joint Conference on Artificial Intelligence (2011). https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-447

  5. McDermott, M.B.A., Yan, T., Naumann, T., Hunt, N., Suresh, H., Szolovits, P., Ghassemi, M.: Semi-supervised biomedical translation with cycle Wasserstein regression GaNs. In: 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 (2018). https://doi.org/10.1609/aaai.v32i1.11890

  6. Mohan, K., Pearl, J., Tian, J.: Graphical models for inference with missing data. In: Advances in Neural Information Processing Systems (2013)

    Google Scholar 

  7. Rajkomar, A., Hardt, M., Howell, M.D., Corrado, G., Chin, M.H.: Ensuring fairness in machine learning to advance health equity. Ann. Internal Med. 169(12) (2018). https://doi.org/10.7326/M18-1990

  8. Lee, E.K., Wang, Y., He, Y., Egan, B.M.: An efficient, robust, and customizable information extraction and pre-processing pipeline for electronic health records. In: IC3K 2019 - Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, vol. 1 (2019). https://doi.org/10.5220/0008071303100321

  9. Lee, E.K., Egan, B.M.: Free text to standardized concepts to clinical decisions. In: Wang, J. (ed.) Encyclopedia of Data Science and Machine Learning. IGI Global (2022)

    Google Scholar 

  10. Lee, E.K., Yuan, F., Hirsh, D.A., Mallory, M.D., Simon, H.K.: A clinical decision tool for predicting patient care characteristics: patients returning within 72 hours in the emergency department. In: AMIA Annual Symposium Proceedings/AMIA Symposium. AMIA Symposium 2012 (2012)

    Google Scholar 

  11. Suresh, H., et al.: Proceedings of Machine Learning for Healthcare 2017 Clinical Intervention Prediction and Understanding with Deep Neural Networks. Ml4H, 68 (2017)

    Google Scholar 

  12. Basha, S.J., Madala, S.R., Vivek, K., Kumar, E.S., Ammannamma, T.: A review on imbalanced data classification techniques. In: 2022 International Conference on Advanced Computing Technologies and Applications (ICACTA), pp. 1–6 (2022). https://doi.org/10.1109/ICACTA54488.2022.9753392

  13. Fujiwara, K., et al.: Over- and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis. Front. Public Health 8, 178 (2020). https://doi.org/10.3389/fpubh.2020.00178

    Article  Google Scholar 

  14. Gao, L., Zhang, L., Liu, C., Wu, S.: Handling imbalanced medical image data: a deep-learning-based one-class classification approach. Artif. Intell. Med. 108 (2020). https://doi.org/10.1016/j.artmed.2020.101935

  15. O’Leary, L.: How IBM’s Watson Went From the Future of Health Care to Sold Off for Parts. https://slate.com/technology/2022/01/ibm-watson-health-failure-artificial-intelligence.html. Accessed 22 Jan 2023

  16. Sweeney, E.: Experts say IBM Watson’s flaws are rooted in data collection and interoperability. https://www.fiercehealthcare.com/analytics/ibm-watson-s-flaws-trace-back-to-data-collection-interoperability. Accessed 23 Jan 2023

  17. Lee, E.K., Li, Z., Wang, Y., Hagen, M.S., Davis, R., Egan, B.M.: Multi-site best practice discovery: from free text to standardized concepts to clinical decisions. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2766–2773 (2021). https://doi.org/10.1109/BIBM52615.2021.9669414

  18. Ghassemi, M., Naumann, T., Schulam, P., Beam, A.L., Chen, I.Y., Ranganath, R.: A review of challenges and opportunities in machine learning for health. In: AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science, 2020 (2020)

    Google Scholar 

  19. Cui, L., Yang, S., Chen, F., Ming, Z., Lu, N., Qin, J.: A survey on application of machine learning for Internet of Things. Int. J. Mach. Learn. Cybern. 9(8) (2018). https://doi.org/10.1007/s13042-018-0834-5

  20. Dixon, M.F., Halperin, I., Bilokon, P.: Machine learning in finance: from theory to practice. In: Machine Learning in Finance: From Theory to Practice (2020). https://doi.org/10.1007/978-3-030-41068-1

  21. Hayward, K.J., Maas, M.M.: Artificial intelligence and crime: a primer for criminologists. Crime Media Cult. 17(2) (2021). https://doi.org/10.1177/1741659020917434

  22. Lei, Y., Yang, B., Jiang, X., Jia, F., Li, N., Nandi, A.K.: Applications of machine learning to machine fault diagnosis: a review and roadmap. Mech. Syst. Signal Process. 138 (2020). https://doi.org/10.1016/j.ymssp.2019.106587

  23. Myszczynska, M.A., et al.: Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol. 16(8) (2020). https://doi.org/10.1038/s41582-020-0377-8

  24. Narciso, D.A.C., Martins, F.G.: Application of machine learning tools for energy efficiency in industry: a review. Energy Rep. 6 (2020). https://doi.org/10.1016/j.egyr.2020.04.035

  25. Qu, K., Guo, F., Liu, X., Lin, Y., Zou, Q.: Application of machine learning in microbiology. Front. Microbiol. 10(Apr) (2019). https://doi.org/10.3389/fmicb.2019.00827

  26. Yarkoni, T., Westfall, J.: Choosing prediction over explanation in psychology: lessons from machine learning. Perspect. Psychol. Sci. 12(6) (2017). https://doi.org/10.1177/1745691617693393

  27. Zhao, S., et al.: Application of machine learning in intelligent fish aquaculture: a review. Aquaculture 540 (2021). https://doi.org/10.1016/j.aquaculture.2021.736724

  28. Efron, B., et al.: Least angle regression. Ann. Stat. 32(2) (2004). https://doi.org/10.1214/009053604000000067

  29. Tibshirani, R.: Regression shrinkage and selection via the lasso: a retrospective. J. Roy. Stat. Soc. Ser. B Stat. Methodol. 73(3) (2011). https://doi.org/10.1111/j.1467-9868.2011.00771.x

  30. Hocking, R.R., Leslie, R.N.: Selection of the best subset in regression analysis. Technometrics 9(4) (1967). https://doi.org/10.1080/00401706.1967.10490502

  31. Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognit. Lett. 15(11) (1994). https://doi.org/10.1016/0167-8655(94)90127-9

  32. Silva, A.P.D., Stam, A.: Second order mathematical programming formulations for discriminant analysis. Eur. J. Oper. Res. 72(1) (1994). https://doi.org/10.1016/0377-2217(94)90324-7

  33. Siedlecki, W., Sklansky, J.: A note on genetic algorithms for large-scale feature selection. Pattern Recognit. Lett. 10(5) (1989). https://doi.org/10.1016/0167-8655(89)90037-8

  34. Kennedy, J., Eberhart, R.C.: Discrete binary version of the particle swarm algorithm. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 5 (1997). https://doi.org/10.1109/icsmc.1997.637339

  35. Agrafiotis, D.K., Cedeño, W.: Feature selection for structure-activity correlation using binary particle swarms. J. Med. Chem. 45(5) (2002). https://doi.org/10.1021/jm0104668

  36. Correa, E.S., Freitas, A.A., Johnson, C.G.: A new discrete particle swarm algorithm applied to attribute selection in a bioinformatics data set. In: GECCO 2006 - Genetic and Evolutionary Computation Conference, vol. 1 (2006). https://doi.org/10.1145/1143997.1144003

  37. Hu, Y., Zhang, Y., Gong, D.: Multiobjective particle swarm optimization for feature selection with fuzzy cost. IEEE Trans. Cybern. 51(2) (2021). https://doi.org/10.1109/TCYB.2020.3015756

  38. Jain, N.K., Nangia, U., Jain, J.: A review of particle swarm optimization. J. Inst. Eng. (India): Ser. B 99(4) (2018). https://doi.org/10.1007/s40031-018-0323-y

  39. Monteiro, S.T., Kosugi, Y.: Particle swarms for feature extraction of hyperspectral data. IEICE Trans. Inf. Syst. E90-D(7) (2007). https://doi.org/10.1093/ietisy/e90-d.7.1038

  40. Gallagher, R.J., Lee, E.K., Patterson, D.A.: Constrained discriminant analysis via 0/1 mixed integer programming. Ann. Oper. Res. 74 (1997). https://doi.org/10.1023/a:1018943025993

  41. World Health Organization. Cardiovascular diseases (2022). https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1. Accessed 23 Jan 2023

  42. Tsao, C.W., et al.: Heart disease and stroke statistics-2022 update: a report from the American heart association. Circulation 145(8), e153–e639 (2022). https://doi.org/10.1161/CIR.0000000000001052. Epub 2022 Jan 26. Erratum in: Circulation. 2022 Sep 6;146(10):e141. PMID: 35078371

  43. Cardiovascular diseases affect nearly half of American adults, statistics show. American Heart Association News (2019). https://www.heart.org/en/news/2019/01/31/cardiovascular-diseases-affect-nearly-half-of-american-adults-statistics-show

  44. Gordon, T., Castelli, W.P., Hjortland, M.C., Kannel, W.B., Dawber, T.R.: High density lipoprotein as a protective factor against coronary heart disease. The Framingham study. Am. J. Med. 62(5) (1977). https://doi.org/10.1016/0002-9343(77)90874-9

  45. Nwegbu, N., Tirunagari, S., Windridge, D.: A novel kernel based approach to arbitrary length symbolic data with application to type 2 diabetes risk. Sci. Rep. 12(1) (2022). https://doi.org/10.1038/s41598-022-08757-1

  46. Ogurtsova, K., et al.: IDF diabetes atlas: global estimates for the prevalence of diabetes for 2015 and 2040. Diabetes Res. Clin. Pract. 128 (2017)

    Google Scholar 

  47. Riddle, M.C., Herman, W.H.: The cost of diabetes care—an elephant in the room. Diabetes Care 41, 929–932 (2018)

    Article  Google Scholar 

  48. American Diabetes Association. Statistics About Diabetes (2022). https://diabetes.org/about-us/statistics/about-diabetes

  49. American Diabetes Association. Economic Costs of Diabetes in the U.S. in 2017. Diabetes Care 41(5), 917–928 (2018). https://doi.org/10.2337/dci18-0007. PMID 29567642; PMCID PMC5911784

  50. Nathan, D.M., et al.: Diabetes control and complications trial/epidemiology of diabetes interventions and complications (DCCT/EDIC) study research group. Intensive diabetes treatment and cardiovascular disease in patients with type 1 diabetes. N. Engl. J. Med. 353(25), 2643–2653 (2005). https://doi.org/10.1056/NEJMoa052187. PMID 16371630; PMCID PMC2637991

  51. Caiado, J., Crato, N., Peña, D.: Comparison of times series with unequal length in the frequency domain. Commun. Stat. Simul. Comput.® 38(3), 527–540 (2009)

    Google Scholar 

  52. World Health Organization. The top 10 causes of death (2022). https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death. Accessed 24 Jan 2023

  53. Kluger, A., Ferris, S.H., Golomb, J., Mittelman, M.S., Reisberg, B.: Neuropsychological prediction of decline to dementia in nondemented elderly. J. Geriatr. Psychiatry Neurol. 12(4) (1999). https://doi.org/10.1177/089198879901200402

  54. Lopez, O.L., et al.: Neuropsychological characteristics of mild cognitive impairment subgroups. J. Neurol. Neurosurg. Psychiatry 77(2) (2006). https://doi.org/10.1136/jnnp.2004.045567

  55. Lee, E.K., Wu, T.L.: Classification and disease prediction via mathematical programming. In: Springer Optimization and Its Applications, vol. 26 (2009). https://doi.org/10.1007/978-0-387-09770-1_12

  56. Lee, E.K., Wu, T.L., Goldstein, F., Levey, A.: Predictive model for early detection of mild cognitive impairment and Alzheimer’s disease. Fields Inst. Commun. 63 (2012). https://doi.org/10.1007/978-1-4614-4133-5_4

  57. Stuss, D.T., Trites, R.L.: Classification of neurological status using multiple discriminant function analysis of neuropsychological test scores. J. Consult. Clin. Psychol. 45(1) (1977). https://doi.org/10.1037/0022-006X.45.1.145

  58. Tabert, M.H., et al.: Neuropsychological prediction of conversion to Alzheimer disease in patients with mild cognitive impairment. Arch. Gen. Psychiatry 63(8) (2006). https://doi.org/10.1001/archpsyc.63.8.916

  59. Hu, W.T., et al.: Plasma multianalyte profiling in mild cognitive impairment and Alzheimer Disease. Neurology 79(9) (2012). https://doi.org/10.1212/WNL.0b013e318266fa70

  60. Hu, W.T., et al.: CSF complement 3 and factor H are staging biomarkers in Alzheimer’s disease. Acta Neuropathol. Commun. 4 (2016). https://doi.org/10.1186/s40478-016-0277-8

  61. Palmqvist, S., et al.: Discriminative accuracy of plasma phospho-tau217 for Alzheimer disease vs other neurodegenerative disorders. JAMA J. Am. Med. Assoc. 324(8) (2020). https://doi.org/10.1001/jama.2020.12134

  62. Ray, S., et al.: Classification and prediction of clinical Alzheimer’s diagnosis based on plasma signaling proteins. Nat. Med. 13(11) (2007). https://doi.org/10.1038/nm1653

  63. Reddy, M.M., et al.: Identification of candidate IgG biomarkers for Alzheimer’s disease via combinatorial library screening. Cell 144(1) (2011). https://doi.org/10.1016/j.cell.2010.11.054

  64. Rocha de Paula, M.R., Gómez Ravetti, M., Berretta, R., Moscato, P.: Differences in abundances of cell-signalling proteins in blood reveal novel biomarkers for early detection of clinical Alzheimer’s disease. PLoS ONE 6(3) (2011). https://doi.org/10.1371/journal.pone.0017481

  65. Schindler, S.E., Bateman, R.J.: Combining blood-based biomarkers to predict risk for Alzheimer’s disease dementia. Nat. Aging 1(1) (2021). https://doi.org/10.1038/s43587-020-00008-0

  66. Riddle, D.L., Jiranek, W.A., Hayes, C.W.: Use of a validated algorithm to judge the appropriateness of total knee arthroplasty in the united states: a multicenter longitudinal cohort study. Arthritis Rheumatol. 66(8), 2134–2143 (2014)

    Article  Google Scholar 

  67. Mora, J.C., Przkora, R., Cruz-Almeida, Y.: Knee osteoarthritis: pathophysiology and current treatment modalities. J. Pain Res. 11, 2189–2196 (2018). https://doi.org/10.2147/JPR.S154002. PMID: 30323653; PMCID: PMC6179584.

  68. Bellamy, N.: WOMAC Osteoarthritis Index User Guide. Version V. Brisbane, Australia (2002)

    Google Scholar 

  69. Hays, R.D., Sherbourne, C.D., Mazel, R.M.: The RAND 36-item health survey 1.0. Health Econ. 2(3), 217–227 (1993)

    Google Scholar 

  70. Marx, R.G., Stump, T.J., Jones, E.C., Wickiewicz, T.L., Warren, R.F.: Development and evaluation of an activity rating scale for disorders of the knee. Am. J. Sports Med. 29, 213–218 (2001)

    Article  Google Scholar 

  71. Sangha, O., Stucki, G., Liang, M.H., Fossel, A.H., Katz, J.N.: The self-administered comorbidity questionnaire: a new method to assess comorbidity for clinical and health services research. Arthritis Rheum. 49, 156–163 (2003)

    Article  Google Scholar 

  72. Brooks, R.: EuroQol: the current state of play. Health Policy 37(1), 53–72 (1996)

    Article  Google Scholar 

  73. Lorig, K., Chastain, R.L., Ung, E., Shoor, S., Holman, H.R.: Development and evaluation of a scale to measure perceived self-efficacy in people with arthritis. Arthritis Rheum. 32, 37–44 (1989)

    Article  Google Scholar 

  74. Ebrahimzadeh, M.H., Makhmalbaf, H., Birjandinejad, A., Keshtan, F.G., Hoseini, H.A., Mazloumi, S.M.: The western Ontario and Mcmaster universities osteoarthritis index (WOMAC) in Persian speaking patients with knee osteoarthritis. Arch. Bone Jt. Surg. 2(1), 57–62 (2014). PMID 25207315; PMCID PMC4151432

    Google Scholar 

  75. Hochberg, M.C., Altman, R.D., Brandt, K.D., Moskowitz, R.W.: Design and conduct of clinical trials in osteoarthritis: preliminary recommendations from a task force of the osteoarthritis research society. J. Rheumatol. 24, 792–794 (1997)

    Google Scholar 

  76. Lee, E.K., Mann, B.J., DeMaio, M.: Prediction of responses to intra-articular injections of Hyaluronic acid for knee osteoarthritis. Preprint (2023)

    Google Scholar 

  77. Lee, E.K., Gallagher, R.J., Patterson, D.A.: A linear programming approach to discriminant analysis with a reserved-judgment region. INFORMS J. Comput. 15(1) (2003). https://doi.org/10.1287/ijoc.15.1.23.15158

  78. Shapoval, A., Lee, E.K.: Generalizing 0–1 conflict hypergraphs and mixed conflict graphs: mixed conflict hypergraphs in discrete optimization. J. Glob. Optim. 80(4) (2021). https://doi.org/10.1007/s10898-021-01012-3

Download references

Acknowledgements

A portion of the results from this project (the machine learning advances, and the results obtained for cardiovascular disease and diabetes) received the first runner-up prize at the 2019 Caterpillar and INFORMS Innovative Applications in Analytics Award. This work is partially supported by grants from the National Science Foundation (IIP-1361532), and the American Orthopedic Society for Sports Medicine. Findings and conclusions in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation and the American Orthopedic Society for Sports Medicine. The authors would like to acknowledge the participation of Zhuonan Li in this project. The authors also thank Dr. Allan Levey, Dr. Felicia Goldstein, and Dr. William Hu of Emory Alzheimer’s Disease Research Center for their collaboration and clinical advice. The authors extend their deepest respect and gratitude to the late Dr. Barton J. Mann PhD, with whom we collaborated on the knee osteoarthritis research, and to Dr. Captain Marlene DeMaio for her clinical guidance and collaboration on the project. We thank the anonymous reviewers for their useful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eva K. Lee .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lee, E.K., Yuan, F., Man, B.J., Egan, B. (2023). A General-Purpose Multi-stage Multi-group Machine Learning Framework for Knowledge Discovery and Decision Support. In: Coenen, F., et al. Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2022. Communications in Computer and Information Science, vol 1842. Springer, Cham. https://doi.org/10.1007/978-3-031-43471-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43471-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43470-9

  • Online ISBN: 978-3-031-43471-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics