Skip to main content

Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization

  • Chapter
Computer and Information Science

Part of the book series: Studies in Computational Intelligence ((SCI,volume 131))

Summary

The Pima Indian diabetes (PID) dataset [1], originally donated by Vincent Sigillito from the Applied Physics Laboratory at the Johns Hopkins University, is one of the most well-known datasets for testing classification algorithms. This dataset consists of records describing 786 female patients of Pima Indian heritage which are at least 21 years old living near Phoenix, Arizona, USA. The problem is to predict whether a new patient would test positive for diabetes. However, the correct classification percentage of current algorithms on this dataset is oftentimes coincidental. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification algorithm when it is processing a dataset. Although the above situation is of fundamental importance in data mining, it has not been studied from a comprehensive point of view. Thus, this paper describes a new approach, called the Homogeneity- Based Algorithm (or HBA) as developed by Pham and Triantaphyllou in [2-3], to optimally control the overfitting and overgeneralization behaviors of classification on this dataset. The HBA is used in conjunction with traditional classification approaches (such as Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), or Decision Trees (DTs)) to enhance their classification accuracy. Some computational results seem to indicate that the proposed approach significantly outperforms current approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Asuncion, A., Newman, D.J.: UCI-Machine Learning Repository. School of Information and Computer Sciences. University of California, Irvine, California, USA (2007)

    Google Scholar 

  2. Pham, H.N.A., Triantaphyllou, E.: The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining. In: Maimon, O., Rokach, L. (eds.) Soft Computing for Knowledge Discovery and Data Mining, Part 4, ch. 5, pp. 391–431. Springer, Heidelberg (2007)

    Google Scholar 

  3. Pham, H.N.A., Triantaphyllou, E.: An Optimization Approach for Improving Accuracy by Balancing Overfitting and Overgeneralization in Data Mining (January 2008) (submitted for publication)

    Google Scholar 

  4. American Diabetes Association (2007), http://www.diabetes.org/home.jsp

  5. World Health Organization, Diabetes Mellitus: Report of a WHO Study Group. Geneva: WHO, Technical Report Series 727 (1985)

    Google Scholar 

  6. Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of 12th Symposium on Computer Applications and Medical Care, Los Angeles, California, USA, pp. 261–265 (1988)

    Google Scholar 

  7. Jankowski, N., Kadirkamanathan, V.: Statistical control of RBF-like networks for classification. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 385–390. Springer, Heidelberg (1997)

    Chapter  Google Scholar 

  8. Au, W.H., Chan, K.C.C.: Classification with degree of membership: A fuzzy approach. In: Proceedings of the 1st IEEE Int’l Conference on Data Mining, San Jose, California, USA, pp. 35–42 (2001)

    Google Scholar 

  9. Rutkowski, L., Cpalka, K.: Flexible neuro-fuzzy systems. IEEE Transactions on Neural Networks 14, 554–574 (2003)

    Article  Google Scholar 

  10. Davis IV, W.L.: Enhancing Pattern Classification with Relational Fuzzy Neural Networks and Square BKProducts. PhD Dissertation in Computer Science, pp. 71 - 74 (2006)

    Google Scholar 

  11. Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification, ch. 9. Series Artificial Intelligence, pp. 157–160. Prentice Hall, Englewood Cliffs (1994)

    Google Scholar 

  12. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, pp. 56–64. Wiley Publisher, Chichester (1973)

    MATH  Google Scholar 

  13. Artificial Neural Network Toolbox 6.0 and Statistics Toolbox 6.0, Matlab Version 7.0, http://www.mathworks.com/products/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Roger Lee Haeng-Kon Kim

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Pham, H.N.A., Triantaphyllou, E. (2008). Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization. In: Lee, R., Kim, HK. (eds) Computer and Information Science. Studies in Computational Intelligence, vol 131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79187-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-79187-4_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-79186-7

  • Online ISBN: 978-3-540-79187-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics