Ensemble of Trees for Classifying High-Dimensional Imbalanced Genomic Data

Farid, Dewan Md.; Nowe, Ann; Manderick, Bernard

doi:10.1007/978-3-319-56994-9_12

Dewan Md. Farid⁵,
Ann Nowe⁵ &
Bernard Manderick⁵

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 15))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

1571 Accesses
3 Citations

Abstract

Machine learning for data mining applications in the field of bioinformatics is to extract new knowledge to provide an improved and effective diagnosis process for patients. In this paper, we introduce an adaptive ensemble learning for classifying high-dimensional multi-class imbalanced genomic data. The aspect is to design and develop an optimal ensemble method for information discovery on genomic data, which improve the prediction accuracy of DNA variant classification. The proposed method is based on ensemble of decision trees, data pre-processing, feature selection and grouping. It converts an imbalanced genomic data into multiple balanced ones and then builds a number of decision trees on these multiple data with specific feature groups. The outputs of these trees are combined for classifying new instances by majority voting technique. In this empirical study, different ensemble predictive modelling techniques like Random Forest, Boosting and Bagging were compared with the proposed ensemble method. The experimental results on genomic data (148 Exome datasets) of Brugada syndrome from the Centre of Medical Genetics, VUB UZ Brussel show that the proposed method is usually superior to the conventional ensemble learning algorithms when classifying the high-dimensional multi-class imbalanced genomic data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Al-Mamun, M.A., Farid, D.M., Ravenhill, L., Hossain, M.A., Fall, C., Bass, R.: An in silico model to demonstrate the effects of maspin on cancer cell dynamics. J. Theoret. Biol. 388, 37–49 (2016)
Article MATH Google Scholar
Yang, H., Chen, Y.-P.P.: Data mining in lung cancer pathologic staging diagnosis: correlation between clinical and pathology information. Expert Syst. Appl. 42(15–16), 6168–6176 (2015)
Article Google Scholar
Milone, D.H., Stegmayer, G., Kamenetzky, L., López, M., Carrari, F.: Clustering biological data with SOMs: on topology preservation in non-linear dimensional reduction. Expert Syst. Appl. 40(9), 3841–3845 (2013)
Article Google Scholar
Liew, A.W.-C., Yan, H., Yang, M.: Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recogn. 38(11), 2055–2073 (2005)
Article Google Scholar
Liu, H., Liu, L., Zhang, H.: Ensemble gene selection for cancer classification. Pattern Recogn. 43(8), 2763–2772 (2010)
Article Google Scholar
Díaz-Uriarte, R., Andres, S.A.D.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7(1), 1 (2006)
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)
Article Google Scholar
López, V.F., Aguilar, R., Alonso, L., Moreno, M.N.: Data mining for grammatical inference with bioinformatics criteria. Expert Syst. Appl. 39(3), 2330–2334 (2012)
Article Google Scholar
Stelle, D., Barioni, M.C., Scott, L.P.: Using data mining to identify structural rules in proteins. Appl. Math. Comput. 218(5), 1997–2004 (2011)
MathSciNet MATH Google Scholar
Hofman, N., Tan, H.L., Alders, M., Kolder, I., de Haij, S., Mannens, M.M., Lombardi, M.P., Dit Deprez, R.H., van Langen, I., Wilde, A.A.: Yield of molecular and clinical testing for arrhythmia syndromes: report of a 15 years’ experience. Circulation 128, 1513–1521 (2013)
Google Scholar
Farid, D.M., Zhang, L., Hossain, A., Rahman, C.M., Strachan, R., Sexton, G., Dahal, K.: An adaptive ensemble classifier for mining concept drifting data streams. Expert Syst. Appl. 40(15), 5895–5906 (2013)
Article Google Scholar
Farid, D.M., Zhang, L., Rahman, C.M., Hossain, M., Strachan, R.: Hybrid decision tree and naïve bayes classifiers for multi-class classification tasks. Expert Syst. Appl. 41(4), 1937–1946 (2014)
Article Google Scholar
Pervez, M.S., Farid, D.M.: Literature review of feature selection for mining tasks. Int. J. Comput. Appl. 116(21), 30–33 (2015)
Google Scholar
Farid, D.M., Rahman, C.M.: Mining complex data streams: discretization, attribute selection and classification. J. Adv. Inf. Technol. 4(3), 129–135 (2013)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Liu, W., Chawla, S.: Class confidence weighted kNN algorithms for imbalanced data sets. Adv. Knowl. Discov. Data Min. 6635, 345–356 (2011)
Google Scholar
Barandela, R., Sánchez, J.S., Garca, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recogn. 36(3), 849–851 (2003)
Article Google Scholar
Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 45(10), 3738–3750 (2012)
Article Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 40(1), 185–197 (2010)
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Google Scholar
Farid, D.M., Rahman, C.M.: Assigning weights to training instances increases classification accuracy. Int. J. Data Min. Knowl. Manag. Process 3(1), 13–25 (2013)
Article Google Scholar
Latkowski, T., Osowski, S.: Data mining for feature selection in gene expression autism data. Expert Syst. Appl. 42(2), 864–872 (2015)
Article Google Scholar
Farid, D.M., Rahman, M.Z., Rahman, C.M.: An ensemble approach to classifier construction based on bootstrap aggregation. Int. J. Comput. Appl. 25(5), 30–34 (2011)
Google Scholar
Karim, M.R., Farid, D.M.: An adaptive ensemble classifier for mining complex noisy instances in data streams. In: 3rd International Conference on Informatics, Electronics and Vision, pp. 1–4, May 2014
Google Scholar
Witten, I.H., Frank, E., Hall, M.A., Mining, D.: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 3(2), 18–22 (2002)
Google Scholar
Vincenzi, S., Zucchetta, M., Franzoi, P., Pellizzato, M., Pranovi, F., Leo, G.A.D., Torricelli, P.: Application of a random forest algorithm to predict spatial distribution of the potential yield of ruditapes philippinarum in the Venice lagoon, Italy. Ecol. Model. 222(8), 1471–1478 (2011)
Article Google Scholar
Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for land cover classification. Pattern Recogn. Lett. 27(4), 294–300 (2006)
Article Google Scholar
Quinlan, J.R.: Induction of decision tree. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. Chapman and Hall/CRC, London (1984)
Google Scholar
Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36(1–2), 105–139 (1999)
Article Google Scholar
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. 11(1), 10–18 (2009)
Article Google Scholar

Download references

Acknowledgment

We appreciate the support for this research received from the BRiDGEIris (BRussels big Data platform for sharing and discovery in clinical GEnomics) project that is being hosted by IB\(^{2}\) (Interuniversity Institute of Bioinformatics in Brussels) and funded by INNOVIRIS (Brussels Institute for Research and Innovation). Also, FWO research project G004414N “Machine Learning for Data Mining Applications in Cancer Genomics”.

Author information

Authors and Affiliations

Computational Modeling Lab, Department of Computer Science, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium
Dewan Md. Farid, Ann Nowe & Bernard Manderick

Authors

Dewan Md. Farid
View author publications
You can also search for this author in PubMed Google Scholar
Ann Nowe
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Manderick
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dewan Md. Farid .

Editor information

Editors and Affiliations

Faculty of Computing and Engineering, School of Computing and Mathematics, University of Ulster at Jordanstown, Newtownabbey, United Kingdom
Yaxin Bi
The Science and Information (SAI) Organization, Bradford, West Yorkshire, United Kingdom
Supriya Kapoor
The Science and Information (SAI) Organization, Bradford, West Yorkshire, United Kingdom
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Farid, D.M., Nowe, A., Manderick, B. (2018). Ensemble of Trees for Classifying High-Dimensional Imbalanced Genomic Data. In: Bi, Y., Kapoor, S., Bhatia, R. (eds) Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016. IntelliSys 2016. Lecture Notes in Networks and Systems, vol 15. Springer, Cham. https://doi.org/10.1007/978-3-319-56994-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-56994-9_12
Published: 20 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56993-2
Online ISBN: 978-3-319-56994-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics