Review
Predictive data mining in clinical medicine: Current issues and guidelines

https://doi.org/10.1016/j.ijmedinf.2006.11.006Get rights and content

Abstract

Background

The widespread availability of new computational methods and tools for data analysis and predictive modeling requires medical informatics researchers and practitioners to systematically select the most appropriate strategy to cope with clinical prediction problems. In particular, the collection of methods known as ‘data mining’ offers methodological and technical solutions to deal with the analysis of medical data and construction of prediction models. A large variety of these methods requires general and simple guidelines that may help practitioners in the appropriate selection of data mining tools, construction and validation of predictive models, along with the dissemination of predictive models within clinical environments.

Purpose

The goal of this review is to discuss the extent and role of the research area of predictive data mining and to propose a framework to cope with the problems of constructing, assessing and exploiting data mining models in clinical medicine.

Methods

We review the recent relevant work published in the area of predictive data mining in clinical medicine, highlighting critical issues and summarizing the approaches in a set of learned lessons.

Results

The paper provides a comprehensive review of the state of the art of predictive data mining in clinical medicine and gives guidelines to carry out data mining studies in this field.

Conclusions

Predictive data mining is becoming an essential instrument for researchers and clinical practitioners in medicine. Understanding the main issues underlying these methods and the application of agreed and standardized procedures is mandatory for their deployment and the dissemination of results. Thanks to the integration of molecular and clinical data taking place within genomic medicine, the area has recently not only gained a fresh impulse but also a new set of complex problems it needs to address.

Introduction

Over the last few years, the term ‘data mining’ has been increasingly used in the medical literature. In general, the term has not been anchored to any precise definition but to some sort of common understanding of its meaning: the use of (novel) methods and tools to analyze large amounts of data. Data mining has been applied with success to different fields of human endeavor, including marketing, banking, customer relationship management, engineering and various areas of science. However, its application to the analysis of medical data – despite high hopes – has until recently been relatively limited. This is particularly true of practical applications in clinical medicine which may benefit from specific data mining approaches that are able to perform predictive modeling, exploit the knowledge available in the clinical domain and explain proposed decisions once the models are used to support clinical decisions. The goal of predictive data mining in clinical medicine is to derive models that can use patient-specific information to predict the outcome of interest and to thereby support clinical decision-making. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which – once evaluated and verified – may be embedded within clinical information systems.

In this paper, we give a methodological review of data mining, focusing on its data analysis process and highlighting some of the most relevant issues related to its application in clinical medicine. We limit the paper's scope to predictive data mining whose methods are methodologically ripe and often easily available and may be particularly suitable for the class of problems arising from clinical data analysis and decision support.

Section snippets

Background

Data mining is the process of selecting, exploring and modeling large amounts of data in order to discover unknown patterns or relationships which provide a clear and useful result to the data analyst [1]. Coined in the mid-1990s, the term data mining has today become a synonym for ‘Knowledge Discovery in Databases’ which, as proposed by Fayyad et al. [2], emphasized the data analysis process rather than the use of specific analysis methods. Data mining problems are often solved by using a

Contribution of data mining to predictive modeling in clinical medicine

Predictive models in clinical medicine are ‘… tools for helping decision making that combine two or more items of patient data to predict clinical outcomes’ [68]. Such models may be used in several clinical contexts by clinicians and may allow a prompt reaction to unfavorable situations [69]. Data mining may effectively contribute to the development of clinically useful predictive models thanks to at least three inter-related aspects: (a) a comprehensive and purposive approach to data analysis

Predictive data mining process: tasks and guidelines

Data mining is most often the application of a number of different techniques from various disciplines with the goal to discover interesting patterns from data. Given the large variety of techniques available and interdisciplinary fields, it is no surprise that data mining is often viewed as a craft that is hard to learn and even harder to master.

As we mentioned, several process models and standards have been proposed to introduce engineering principles, systemize the process and define typical

Discussion

Compared to data mining in business, marketing and the economy, medical data mining applications have several distinguishing features [104]. The most important one is that medicine is a safety critical context [105] in which decision-making activities should always be supported by explanations. This means that the value of each datum may be higher than in other contexts: experiments can be costly due to the involvement of the personnel and use of expensive instrumentation and due to the

Conclusion

At present, many ripe predictive data mining methods have been successfully applied to a variety of practical problems in clinical medicine. As suggested by Hand [40], data mining is particularly successful where data are in abundance. For clinical medicine, this includes the analysis of clinical data warehouses, epidemiological studies and emerging studies in genomics and proteomics. Crucial to such data are those data mining approaches which allow the use of the background knowledge, discover

Acknowledgements

The authors would like to acknowledge the help given by the International Medical Informatics Association and its Working Group on Intelligent Data Analysis and Data Mining, which they are chairing. The work was supported by a Slovenian-Italian Bilateral Collaboration Project. RB is also supported by the Italian Ministry of University and Scientific Research through the PRIN Project ‘Dynamic modeling of gene and protein expression profiles: clustering techniques and regulatory networks’, and BZ

References (108)

  • E.H. Shortliffe et al.

    Computer-based consultations in clinical therapeutics: explanation and rule acquisition capabilities of the MYCIN system

    Comput. Biomed. Res.

    (1975)
  • B. Sierra et al.

    Predicting survival in malignant skin melanoma using Bayesian networks automatically induced by genetic algorithms An empirical comparison between different approaches

    Artif. Intell. Med.

    (1998)
  • B. Zupan et al.

    Knowledge-based data analysis and interpretation

    Artif. Intell. Med.

    (2006)
  • S. Mani et al.

    Two-stage machine learning model for guideline development

    Artif. Intell. Med.

    (1999)
  • R. Kohavi et al.

    Wrappers for feature subset selection

    Artif. Intell.

    (1997)
  • B. Zupan et al.

    Machine learning for survival analysis: a case study on recurrence of prostate cancer

    Artif. Intell. Med.

    (2000)
  • P. Giudici

    Applied Data Mining Statistical Methods for Business and Industry

    (2003)
  • U. Fayyad et al.

    Data mining and knowledge discovery in databases

    Commun. ACM

    (1996)
  • B. Zupan et al.

    Predicting patient's long-term clinical status after hip arthroplasty using hierarchical decision modelling and data mining

    Meth. Inf. Med.

    (2001)
  • J. Demsar et al.

    Orange: from experimental machine learning to interactive data mining

  • I. Kononenko

    Inductive and Bayesian learning in medical diagnosis

    Appl. Artif. Intelligen.

    (1993)
  • J. Lubsen et al.

    A practical device for the application of a diagnostic or prognostic function

    Meth. Inf. Med.

    (1978)
  • M. Mozina et al.

    Nomograms for visualization of naive bayesian classifier

  • F.E. Harrell

    Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis

    (2001)
  • M.W. Kattan et al.

    A preoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer

    J. Natl. Cancer Inst.

    (1998)
  • M. Graefen et al.

    International validation of a preoperative nomogram for prostate cancer recurrence after radical prostatectomy

    J. Clin. Oncol.

    (2002)
  • J.R. Quinlan

    C4.5: Programs for Machine Learning

    (1993)
  • L. Breiman

    Classification and Regression Trees

    (1993)
  • P. Clark et al.

    The CN2 Induction Algorithm

    Mach. Learn.

    (1989)
  • R.S. Michalski et al.

    Learning patterns in noisy data: the AQ approach

  • N. Lavrac et al.

    Intelligent data analysis for medical diagnosis: using machine learning and temporal abstraction

    AI Commun.

    (1998)
  • D.W. Hosmer et al.

    Applied Logistic Regression

    (2000)
  • T. Hastie et al.

    The Elements of Statistical Learning: Data Mining, Inference, and Prediction

    (2001)
  • G. Schwarzer et al.

    On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology

    Stat. Med.

    (2000)
  • N. Cristianini et al.

    An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

    (2000)
  • V.N. Vapnik

    Statistical Learning Theory

    (1998)
  • C. Cortes et al.

    Support-vectors networks

    Mach. Learn.

    (1995)
  • P.W. Hamilton et al.

    Clinical applications of Bayesian belief networks in pathology

    Pathologica

    (1995)
  • D.J. Spiegelhalter et al.

    Sequential updating of conditional probabilities on directed graphical structures

    Networks

    (1990)
  • W.L. Buntine

    A guide to the literature on learning probabilistic networks from data

    IEEE Trans. Know. Data Eng.

    (1996)
  • G.F. Cooper et al.

    A Bayesian method for the induction of probabilistic networks from data

    Mach. Learn.

    (1992)
  • M. Ramoni et al.

    Robust learning with missing data

    Mach. Learn.

    (2001)
  • P. Sebastiani et al.

    Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia

    Nat. Genet.

    (2005)
  • P. Larrañaga et al.

    P J M. Learning Bayesian networks by genetic algorithms: a case study in the prediction of survival in malignant skin melanoma

  • P. Le Phillip et al.

    Using prior knowledge to improve genetic network reconstruction from microarray data

    In. Silico. Biol.

    (2004)
  • P. Chapman et al.

    CRISP-DM 1. 0: Step-by-Step Data Mining Guide: The CRISP-DM Consortium

    (2000)
  • G.W. Moore et al.

    Anatomic pathology data mining

  • D. Hristovski et al.

    Supporting discovery in medicine by association rule mining in Medline and UMLS

    Medinfo

    (2001)
  • A.R. Aronson

    Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program

    Proc. AMIA Symp.

    (2001)
  • D.J. Hand

    Data mining: statistics and more?

    Am. Statist.

    (1998)
  • Cited by (635)

    • Artificial intelligence in healthcare and IJMI scope

      2023, International Journal of Medical Informatics
    View all citing articles on Scopus
    View full text