Results on mining NHANES data: A case study in evidence-based medicine

https://doi.org/10.1016/j.compbiomed.2013.02.018Get rights and content

Abstract

The National Health and Nutrition Examination Survey (NHANES), administered annually by the National Center for Health Statistics, is designed to assess the general health and nutritional status of adults and children in the United States. Given to several thousands of individuals, the extent of this survey is very broad, covering demographic, laboratory and examination information, as well as responses to a fairly comprehensive health questionnaire. In this paper, we adapt and extend association rule mining and clustering algorithms to extract useful knowledge regarding diabetes and high blood pressure from the 1999–2008 survey results, thus demonstrating how data mining techniques may be used to support evidence-based medicine.

Introduction

Evidence-based medicine is an attempt at enriching the decision-making process of healthcare professionals by collecting and analyzing data (e.g., clinical trials), making relevant results readily available for use in diagnoses, prescriptions and treatments. In this way, the evidence collected through systematic research by the larger medical community can be used to complement and extend an individual practitioner's clinical expertise [1].

With the digitization of medical publications, and the deployment of standards and tools for the systematic collection of healthcare patient data, greater opportunities are now available for evidence-based medicine. Over the past 20 years, for example, a number of researchers have focused their attention on literature-based discovery, i.e., the discovery of interesting medical facts from the medical literature [2], [3], [4], [5], [6]. A variety of techniques, such as simple co-occurrence counts, information retrieval measures, and association rules (see [7] for a recent and fairly comprehensive survey), have been used to generate a handful of valuable hypotheses from Medline following Swanson's ABC model of discovery [2]. Most recently, literature-based discovery has been shown to be more reliable and more timely than regulatory agencies at identifying dangerous adverse drug reactions [8]. In the past decade, with the availability of more structured data and the development of novel algorithms, other researchers have increasingly turned their attention to medical data mining, i.e., the discovery of interesting patterns from observational healthcare patient data [9], [10].

In this paper, we focus on the latter in the context of health data and statistical information collected by the National Center for Health Statistics (NCHS) with a view to improve public health. Since 1971, except for a short 4-year gap from 1994 to 1999, the NCHS has been conducting the annual National Health and Nutrition Examination Survey (NHANES) to assess the health and nutritional status of adults and children in the United States. Since 1999, the survey has been administered systematically every other year to approximately 10,000 individuals of all ages, with a response rate of about 80%. Results of the survey are released in two-year blocks, with the five blocks of data for the years 1999–2008 available electronically from the website of the Centers for Disease Control and Prevention (CDC).2

Because of its coverage of a broad range of health-related issues, and its combination of self-reported questionnaire data with lab results and examinations, NHANES has been a rich source of data for the investigation of specific health questions. Most NHANES-derived findings are the result of sophisticated statistical analyses. To the best of our knowledge, the application of data mining to NHANES—and indeed in the public health domain in general—has been limited. Here, we mine the 1999–2008 NHANES data for knowledge on diabetes and high blood pressure.

The motivation for our choice is two-fold. First, it is well known that data mining is more than blindly applying algorithms to data hoping that something useful will be discovered. Successful application of data mining requires domain knowledge and a clearly articulated objective or area of interest [11]. Furthermore, the application of data mining to new domains often raises interesting issues that current techniques do not handle well, and thus provides opportunities for useful algorithmic development and extensions. Second, diabetes and high blood pressure, in addition to being specific, are two significant and increasingly concerning modern health issues. Diabetes is a disease characterized by high levels of blood glucose. It is often accompanied by other serious health complications and may lead to premature death. According to the National Diabetes Information Clearinghouse's 2011 statistics, diabetes is one of the top ten leading causes of death in the United States, and 25.8 million people (8.3%) of all ages have diabetes in the United States [12]. Likewise, high blood pressure is a significant health concern. According to the American Heart Association, about 76.4 million people in the United States age 20 or older have high blood pressure [13], and the death rate due to associated complications, such as heart disease and stroke, increased 25.2% from 1995 to 2005 [14]. Hence, finding related symptoms and quantifying their relationships to high blood pressure and diabetes are valuable efforts.

To illustrate the scope of data mining technology, we make use of several complementary approaches in our analyses. We first look at simple correlations between our selected health issues and other health conditions. We then take a more global view in which we adapt and extend the MSapriori algorithm [15] to apply association rule mining effectively to our data in order to highlight which conditions are more likely to occur with each other. Finally, we propose an original definition of distance between health conditions based on the frequency of co-occurrence of Yes values among health indicators and use it as a basis for clustering, thus bringing out further interesting relationships. In all cases, we check our findings against the medical literature. We find that most of them are supported by existing medical knowledge, which validates the data mining approach. By extension, other findings through data mining can be afforded some credibility. In particular, those rules for which we find no, or little, support in the current literature may offer possibly interesting avenues of further medical investigation. While we certainly do not make any claim of definitiveness about our results, we do highlight some interesting relationships and illustrate the value of data mining in the public health domain.

The paper is organized as follows. Section 2 briefly describes several previous studies of the NHANES data, highlighting the difference with our data mining approach. Section 3 provides basic statistical information about the data, and describes the methodology followed for the important preprocessing phase of the data mining process. Section 4 describes our overall experimental framework, and discusses the results and health-related knowledge uncovered through the use of various data mining strategies. Finally, Section 5 concludes the paper.

Section snippets

Related work

While medical and health informatics have been growing rapidly in the past couple of decades with work ranging from standardization (e.g., ICD-9) to electronic health records (EHR) to Web and mobile health applications (e.g., health forums, iPhone apps, blogs), the take up of medical data mining has been somewhat slower, yet steady [9], [10]. Unsupervised approaches, as we use here, were first introduced to the healthcare field by [16], and have since been used by a number of other researchers

Data preprocessing

The NHANES data for each 2-year cycle consists of a collection of four distinct components, each addressing complementary aspects of health issues and behavior:

  • Indicators and measurements taken during physical examinations by medical professionals, such as audiometry, ophthalmology, body measurements, cardiovascular fitness, oral health, and vision.

  • Results from laboratory analyses, such as blood, urine, diabetes profile, infectious disease profile, nutritional biochemistries, miscellaneous

Experimental results

Data mining techniques include data visualization [31], data pre-processing, such as feature selection and/or extraction [32], classification [33], regression [34], association rule mining [35], and clustering [36]. These techniques, while interesting and useful in their own right, can also often complement each other.

Such is the case in the present study. The NHANES data is descriptive in nature, and thus contains no explicit target as would be the case in a typical classification task. Hence,

Discussion and conclusions

Randomized Controlled Trials (RCTs) have long been the norm in medical research. In a typical RCT, researchers focus on a single health condition wherein two or more groups of individuals are constituted such that the groups are identical in every way, except in the treatment they receive relative to the condition under study. The main advantage of RCTs, and the reason for their popularity in the medical field, is that they make it possible to establish statistical significance and causality

Summary

With the digitization of medical publications, and the deployment of standards and tools for the systematic collection of healthcare patient data, greater opportunities are now available for evidence-based medicine. The National Health and Nutrition Examination Survey (NHANES), administered annually by the National Center for Health Statistics, is designed to assess the general health and nutritional status of adults and children in the United States. Given to several thousands of individuals,

Conflict of interest statement

None declared.

Acknowledgments

We wish to thank Yao Huang Lin, David Wilcox, Matthew Smith, Udip Pant and Kevin Murdock for valuable discussions, assistance in running experiments and comments on the paper. We also wish to thank the organizers of the American Medical informatics Association 2008 Data Mining Competition for bringing this data to our attention and encouraging our efforts.

References (75)

  • M.C. Ganiz, W.M. Pottenger, C.D. Janneck, Recent Advances in Literature based Discovery, Technical Report...
  • K.D. Shetty et al.

    Using information mining of the medical literature to improve drug safety

    J. Am. Med. Inf. Assoc.

    (2011)
  • K.J. Cios, Medical Data Mining and Knowledge Discovery, Studies in Fuzziness and Soft Computing, vol. 60, Springer,...
  • A. Montgomery, Data mining: business hunching, not just data crunching, in: Proceedings of the Second International...
  • Centers for Disease Control and Prevention. National Diabetes Fact Sheet: National Estimates and General Information on...
  • American Heart Association. About High Blood Pressure, Online at...
  • Medical Disability Advisor. High Blood Pressure, Benign, Online at...
  • B. Liu, W. Hsu, Y. Ma, Mining association rules with multiple minimum supports, in: Proceedings of the Fifth ACM SIGKDD...
  • S. Stilou et al.

    Mining association rules from clinical databasesan intelligent diagnostic process in healthcare

    Stud. Health Technol. Inf.

    (2001)
  • S. Doddi et al.

    Discovery of association rules in medical data

    Med. Inf. Internet Med.

    (2001)
  • C. Creighton et al.

    Mining gene expression databases for association rules

    Bioinformatics

    (2003)
  • A.M. Berger et al.

    Data mining as a tool for research and knowledge development in nursing

    Comput. Inf. Nursing

    (2004)
  • T. Mikos et al.

    The use of data mining in the categorization of patients with azoospermia

    Hormones (Athens Greece)

    (2005)
  • T. Imamura et al.

    A technique for identifying three diagnostic findings using association analysis

    Med. Biol. Eng. Comput.

    (2007)
  • R. Bethene Ervin, Prevalence of metabolic syndrome among adults 20 years of age and over, by sex, age, race and...
  • J.D. Wright, R. Hirsch, C.-Y. Wang, One-third of US adults embraced most heart healthy behavior in 1999–2002. NCHS Data...
  • C.D. Fryar, M.C. Merino, R. Hirsch, K.S. Porter, Smoking, alcohol use, and illicit drug use reported by adolescents...
  • S.J. Ventura, Changing patterns of nonmarital childbearing in the united states. NCHS Data Brief, Number 18,...
  • M.A. McDowell, D.A. Lacher, C.M. Pfeiffer, J. Mulinare, M.F. Picciano, J.I. Rader, E.A. Yetley, J. Kennedy-Stephenson,...
  • H.Y. Lin, J. Lee, M. Smith, Dependency mining on the 2005–2006 national health and nutrition examination survey data,...
  • Z. Xing et al.

    Exploring disease association from the NHANES datadata mining, pattern summarization, and visual analytics

    Int. J. Data Warehousing Mining

    (2010)
  • P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, R. Wirth, CRISP-DM 1.0: Step-by-step Data...
  • National Center for Health Statistics. Analytic and reporting guidelines: the national health and nutrition examination...
  • J. Heer et al.

    A tour of the visualization zoo

    Commun. ACM

    (2010)
  • I. Guyon et al.

    An introduction to variable and feature selection

    J. Mach. Learn. Res.

    (2003)
  • S.B. Kotsiantis

    Supervised machine learninga review of classification techniques

    Informatica

    (2007)
  • J. Fox

    Applied Regression Analysis, Linear Models, and Related Methods

    (1997)
  • Cited by (15)

    • Assessing Rural Health Coalitions Using the Public Health Logic Model: A Systematic Review

      2020, American Journal of Preventive Medicine
      Citation Excerpt :

      Pathway itemsets refer to the set of pathway items occurring together; for example, community buy-in is a pathway item, whereas funding, holding regular meetings, community focus groups, and cancer screenings is a pathway itemset. Examples of MBA applied in public health include (1) to identify student dietary and physical activity patterns, weight status and management strategies, and school meal perceptions56; (2) to explore physical activity patterns related to sleeping patterns and access or proximity to exercise equipment57; (3) to determine common co-occurring causes of fatigue and subsequent coping mechanisms among manufacturing employees58; and (4) to mine the National Health and Nutrition Examination Survey for comorbidities of diabetes and high blood pressure.59 MBA has also been applied in operations/management research for optimizing workflow and outputs.55,60,61

    • Conflicting associations between dietary patterns and changes of anthropometric traits across subgroups of middle-aged women and men

      2020, Clinical Nutrition
      Citation Excerpt :

      Compared to conventional food questionnaire data analysis using energy partition regression [23] that reveals linear relationships between dietary items and different traits, other data mining methods can handle large data sets in a non-linear and non-additive fashion taking exposure interactions into account [13,14]. Recently, association rule mining and clustering algorithms have been successfully used to extract useful knowledge for diabetes and high blood pressure from the National Health Examination Survey [24]. Here, we use a recently developed method, Compass [14], to capture such novel and interesting non-additive associations.

    • Stacked classifiers for individualized prediction of glycemic control following initiation of metformin therapy in type 2 diabetes

      2018, Computers in Biology and Medicine
      Citation Excerpt :

      A prototypical approach can be found in Ref. [28], where a cohort is divided into subgroups based on glycemic control status and regression analysis is used to identify significant independent predictors. More generally, algorithmic approaches have been applied to various other aspects of diabetes and diabetes care, including the diagnosis of diabetes [29–32], prediction of complications [33–37], genetic background and environment [38–41], and large scale factors such as healthcare spending on diabetes and diabetes-related subjects, managing health care systems, and enterprise-scale identification of patients with type 2 diabetes [42–46]. An extensive and well written review of the efforts mentioned above as well as many others can be found in Ref. [47].

    • Machine Learning and Data Mining Methods in Diabetes Research

      2017, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      Mining of such data through probabilistic clustering methodologies allows assessment of the health and financial risk status, subsequently aiding in taking the appropriate proactive actions. Ultimately, Lee and Giaraud-Carrier [140] aimed at mining a huge collection of data, through association rules and clustering techniques, to support evidence-based medicine. Data were obtained from The National Health and Nutrition Examination Survey (NHANES), which is a program trying to assess health and nutritional status of adults and children in the United States.

    • Risk factors and prediction of very short term versus short/intermediate term post-stroke mortality: A data mining approach

      2014, Computers in Biology and Medicine
      Citation Excerpt :

      In general terms, we will concentrate on post-stroke mortality and the corresponding risk factors, developing predictive models for stroke mortality over different time horizons relative to admission – very short term, short term and intermediate term. In this sense, our probabilistic models can be considered to be clinical outcomes algorithms which would aid the decision making process of healthcare professionals and data mining analysis is proven to be very useful in work of this nature [5]. Probabilistic models for outcomes have been developed in other circumstances, for instance, in [6,7] for mortality among critically ill hospitalised adults, in [8] for kidney dialysis patients and also in the case of stroke [9].

    • Classification based on Association Rules: Complexity and Interestingness Guided Algorithm

      2023, International Conference on Electrical, Computer, Communications and Mechatronics Engineering, ICECCME 2023
    View all citing articles on Scopus
    1

    Tel.: +82 2 958 5114; fax: +82 2 958 5471.

    View full text