Abstract
Machine learning for data mining applications in the field of bioinformatics is to extract new knowledge to provide an improved and effective diagnosis process for patients. In this paper, we introduce an adaptive ensemble learning for classifying high-dimensional multi-class imbalanced genomic data. The aspect is to design and develop an optimal ensemble method for information discovery on genomic data, which improve the prediction accuracy of DNA variant classification. The proposed method is based on ensemble of decision trees, data pre-processing, feature selection and grouping. It converts an imbalanced genomic data into multiple balanced ones and then builds a number of decision trees on these multiple data with specific feature groups. The outputs of these trees are combined for classifying new instances by majority voting technique. In this empirical study, different ensemble predictive modelling techniques like Random Forest, Boosting and Bagging were compared with the proposed ensemble method. The experimental results on genomic data (148 Exome datasets) of Brugada syndrome from the Centre of Medical Genetics, VUB UZ Brussel show that the proposed method is usually superior to the conventional ensemble learning algorithms when classifying the high-dimensional multi-class imbalanced genomic data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Mamun, M.A., Farid, D.M., Ravenhill, L., Hossain, M.A., Fall, C., Bass, R.: An in silico model to demonstrate the effects of maspin on cancer cell dynamics. J. Theoret. Biol. 388, 37–49 (2016)
Yang, H., Chen, Y.-P.P.: Data mining in lung cancer pathologic staging diagnosis: correlation between clinical and pathology information. Expert Syst. Appl. 42(15–16), 6168–6176 (2015)
Milone, D.H., Stegmayer, G., Kamenetzky, L., López, M., Carrari, F.: Clustering biological data with SOMs: on topology preservation in non-linear dimensional reduction. Expert Syst. Appl. 40(9), 3841–3845 (2013)
Liew, A.W.-C., Yan, H., Yang, M.: Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recogn. 38(11), 2055–2073 (2005)
Liu, H., Liu, L., Zhang, H.: Ensemble gene selection for cancer classification. Pattern Recogn. 43(8), 2763–2772 (2010)
Díaz-Uriarte, R., Andres, S.A.D.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7(1), 1 (2006)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)
López, V.F., Aguilar, R., Alonso, L., Moreno, M.N.: Data mining for grammatical inference with bioinformatics criteria. Expert Syst. Appl. 39(3), 2330–2334 (2012)
Stelle, D., Barioni, M.C., Scott, L.P.: Using data mining to identify structural rules in proteins. Appl. Math. Comput. 218(5), 1997–2004 (2011)
Hofman, N., Tan, H.L., Alders, M., Kolder, I., de Haij, S., Mannens, M.M., Lombardi, M.P., Dit Deprez, R.H., van Langen, I., Wilde, A.A.: Yield of molecular and clinical testing for arrhythmia syndromes: report of a 15 years’ experience. Circulation 128, 1513–1521 (2013)
Farid, D.M., Zhang, L., Hossain, A., Rahman, C.M., Strachan, R., Sexton, G., Dahal, K.: An adaptive ensemble classifier for mining concept drifting data streams. Expert Syst. Appl. 40(15), 5895–5906 (2013)
Farid, D.M., Zhang, L., Rahman, C.M., Hossain, M., Strachan, R.: Hybrid decision tree and naïve bayes classifiers for multi-class classification tasks. Expert Syst. Appl. 41(4), 1937–1946 (2014)
Pervez, M.S., Farid, D.M.: Literature review of feature selection for mining tasks. Int. J. Comput. Appl. 116(21), 30–33 (2015)
Farid, D.M., Rahman, C.M.: Mining complex data streams: discretization, attribute selection and classification. J. Adv. Inf. Technol. 4(3), 129–135 (2013)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Liu, W., Chawla, S.: Class confidence weighted kNN algorithms for imbalanced data sets. Adv. Knowl. Discov. Data Min. 6635, 345–356 (2011)
Barandela, R., Sánchez, J.S., Garca, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recogn. 36(3), 849–851 (2003)
Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 45(10), 3738–3750 (2012)
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 40(1), 185–197 (2010)
Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Farid, D.M., Rahman, C.M.: Assigning weights to training instances increases classification accuracy. Int. J. Data Min. Knowl. Manag. Process 3(1), 13–25 (2013)
Latkowski, T., Osowski, S.: Data mining for feature selection in gene expression autism data. Expert Syst. Appl. 42(2), 864–872 (2015)
Farid, D.M., Rahman, M.Z., Rahman, C.M.: An ensemble approach to classifier construction based on bootstrap aggregation. Int. J. Comput. Appl. 25(5), 30–34 (2011)
Karim, M.R., Farid, D.M.: An adaptive ensemble classifier for mining complex noisy instances in data streams. In: 3rd International Conference on Informatics, Electronics and Vision, pp. 1–4, May 2014
Witten, I.H., Frank, E., Hall, M.A., Mining, D.: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 3(2), 18–22 (2002)
Vincenzi, S., Zucchetta, M., Franzoi, P., Pellizzato, M., Pranovi, F., Leo, G.A.D., Torricelli, P.: Application of a random forest algorithm to predict spatial distribution of the potential yield of ruditapes philippinarum in the Venice lagoon, Italy. Ecol. Model. 222(8), 1471–1478 (2011)
Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for land cover classification. Pattern Recogn. Lett. 27(4), 294–300 (2006)
Quinlan, J.R.: Induction of decision tree. Mach. Learn. 1(1), 81–106 (1986)
Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. Chapman and Hall/CRC, London (1984)
Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36(1–2), 105–139 (1999)
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. 11(1), 10–18 (2009)
Acknowledgment
We appreciate the support for this research received from the BRiDGEIris (BRussels big Data platform for sharing and discovery in clinical GEnomics) project that is being hosted by IB\(^{2}\) (Interuniversity Institute of Bioinformatics in Brussels) and funded by INNOVIRIS (Brussels Institute for Research and Innovation). Also, FWO research project G004414N “Machine Learning for Data Mining Applications in Cancer Genomics”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Farid, D.M., Nowe, A., Manderick, B. (2018). Ensemble of Trees for Classifying High-Dimensional Imbalanced Genomic Data. In: Bi, Y., Kapoor, S., Bhatia, R. (eds) Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016. IntelliSys 2016. Lecture Notes in Networks and Systems, vol 15. Springer, Cham. https://doi.org/10.1007/978-3-319-56994-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-56994-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56993-2
Online ISBN: 978-3-319-56994-9
eBook Packages: EngineeringEngineering (R0)