Abstract
Currently, there is a wealth of biotechnologies (e.g. sequencing, proteomics, lipidomics) able to generate a broad range of data types out of biological samples. However, the knowledge gained from such data sources is constrained by the limitations of the analytics techniques. The state-of-the-art machine learning algorithms are able to capture complex patterns with high prediction capacity. However, often it is very difficult if not impossible to extract human-understandable knowledge out of these patterns. In recent years evolutionary machine learning techniques have shown that they are competent methods for biological/biomedical data analytics. They are able to generate interpretable prediction models and, beyond just prediction models, they are able to extract useful knowledge in the form of biomarkers or biological networks.
The focus of this paper is to thoroughly characterise the impact that a core component of the evolutionary machine learning process, its knowledge representations, has in the process of extracting biologically-useful knowledge out of transcriptomics datasets. Using the FuNeL evolutionary machine learning-based network inference method, we evaluate several variants of rule knowledge representations on a range of transcriptomics datasets to quantify the volume and complementarity of the knowledge that each of them can extract. Overall we show that knowledge representations, often considered a minor detail, greatly impact on the downstream biological knowledge extraction process.
References
Bacardit, J., Burke, E.K., Krasnogor, N.: Improving the scalability of rule-based evolutionary learning. Memet. Comput. 1, 55–67 (2009)
Bacardit, J., Garrell, J.M.: Bloat control and generalization pressure using the minimum description length principle for a Pittsburgh approach learning classifier system. In: Kovacs, T., Llorà , X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003–2005. LNCS (LNAI), vol. 4399, pp. 59–79. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71231-2_5
Bacardit, J., Krasnogor, N.: Empirical evaluation of ensemble techniques for a Pittsburgh learning classifier system. In: Bacardit, J., Bernadó-Mansilla, E., Butz, M.V., Kovacs, T., Llorà , X., Takadama, K. (eds.) IWLCS 2006–2007. LNCS (LNAI), vol. 4998, pp. 255–268. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88138-4_15
Bacardit, J., Stout, M., Hirst, J.D., Valencia, A., Smith, R.E., Krasnogor, N.: Automated alphabet reduction for protein datasets. BMC Bioinform. 10, 6 (2009)
Bacardit, J., Widera, P., Márquez-Chamorro, A., Divina, F., Aguilar-Ruiz, J.S., Krasnogor, N.: Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics 28(19), 2441–2448 (2012)
Bassel, G.W., Glaab, E., Marquez, J., Holdsworth, M.J., Bacardit, J.: Functional network construction in Arabidopsis using rule-based machine learning on large-scale data sets. Plant Cell 23(9), 3101–3116 (2011)
Beer, D.G., Kardia, S.L.R., Huang, C.C., Giordano, T.J., Levin, A.M., Misek, D.E., Lin, L., Chen, G., Gharib, T.G., Thomas, D.G., Lizyness, M.L., Kuick, R., Hayasaka, S., Taylor, J.M.G., Iannettoni, M.D., Orringer, M.B., Hanash, S.: Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 8(8), 816–824 (2002)
Chowdary, D., Lathrop, J., Skelton, J., Curtin, K., Briggs, T., Zhang, Y., Yu, J., Wang, Y., Mazumder, A.: Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative. J. Mol. Diagn.: JMD 8(1), 31–39 (2006)
Fainberg, H.P., Bodley, K., Bacardit, J., Li, D., Wessely, F., Mongan, N.P., Symonds, M.E., Clarke, L., Mostyn, A.: Reduced neonatal mortality in Meishan piglets: a role for hepatic fatty acids? PLoS One 7(11), 1–9 (2012)
Glaab, E., Bacardit, J., Garibaldi, J.M., Krasnogor, N.: Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS One 7(7), e39932 (2012)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)
Gordon, G.J., Jensen, R.V., Hsiao, L.L., Gullans, S.R., Blumenstock, J.E., Ramaswamy, S., Richards, W.G., Sugarbaker, D.J., Bueno, R.: Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 62(17), 4963–4967 (2002)
Hemberg, E., Veeramachaneni, K., Dernoncourt, F., Wagy, M., O’Reilly, U.M.: Efficient training set use for blood pressure prediction in a large scale learning classifier system. In: Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO 2013 Companion, pp. 1267–1274. ACM, New York (2013)
Lazzarini, N., Widera, P., Williamson, S., Heer, R., Krasnogor, N., Bacardit, J.: Functional networks inference from rule-based machine learning models. BioData Min. 9(1), 28 (2016)
Marcozzi, M., Divina, F., Aguilar-Ruiz, J.S., Vanhoof, W.: A novel probabilistic encoding for EAs applied to biclustering of microarray data. In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO 2011, pp. 339–346. ACM, New York (2011)
Martinez-Ballesteros, M., Nepomuceno-Chamorro, I.A., Riquelme, J.C.: Discovering gene association networks by multi-objective evolutionary quantitative association rules. J. Comput. Syst. Sci. 80, 118–136 (2013)
Mi, H., Poudel, S., Muruganujan, A., Casagrande, J.T., Thomas, P.D.: Panther version 10: expanded protein families and functions, and analysis tools. Nucleic Acids Res. 44(D1), D336–D342 (2016)
Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M., McLaughlin, M.E., Kim, J.Y.H., Goumnerova, L.C., Black, P.M., Lau, C., Allen, J.C., Zagzag, D., Olson, J.M., Curran, T., Wetmore, C., Biegel, J.A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D.N., Mesirov, J.P., Lander, E.S., Golub, T.R.: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870), 436–442 (2002)
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L., Aguiar, R.C.T., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G.S., Ray, T.S., Koval, M.A., Last, K.W., Norton, A., Lister, T.A., Mesirov, J., Neuberg, D.S., Lander, E.S., Aster, J.C., Golub, T.R.: Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8(1), 68–74 (2002)
Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P., Lander, E.S., Loda, M., Kantoff, P.W., Golub, T.R., Sellers, W.R.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209 (2002)
Swan, A.L., Stekel, D.J., Hodgman, C., Allaway, D., Alqahtani, M.H., Mobasheri, A., Bacardit, J.: A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data. BMC Genom. 16(1), S2 (2015)
Urbanowicz, R.J., Granizo-Mackenzie, A., Moore, J.H.: An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Comp. Int. Mag. 7(4), 35–45 (2012)
Urbanowicz, R.J., Andrew, A.S., Karagas, M.R., Moore, J.H.: Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach. J. Am. Med. Inform. Assoc. 20(4), 603612 (2013)
Venturini, G.: SIA: a supervised inductive algorithm with genetic search for learning attributes based concepts. In: Brazdil, P.B. (ed.) ECML 1993. LNCS, vol. 667, pp. 280–296. Springer, Heidelberg (1993). doi:10.1007/3-540-56602-3_142
Yagi, T., Morimoto, A., Eguchi, M., Hibi, S., Sako, M., Ishii, E., Mizutani, S., Imashuku, S., Ohki, M., Ichikawa, H.: Identification of a gene expression signature associated with pediatric AML prognosis. Blood 102(5), 1849–1856 (2003)
Acknowledgments
This work was supported by the Engineering and Physical Sciences Research Council [EP/N031962/1]. We are grateful to the School of Computing Science of Newcastle University for the access to its High Performance Computing Cluster. We thank the anonymous reviewers for the valuable feedback received.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Baron, S., Lazzarini, N., Bacardit, J. (2017). Characterising the Influence of Rule-Based Knowledge Representations in Biological Knowledge Extraction from Transcriptomics Data. In: Squillero, G., Sim, K. (eds) Applications of Evolutionary Computation. EvoApplications 2017. Lecture Notes in Computer Science(), vol 10199. Springer, Cham. https://doi.org/10.1007/978-3-319-55849-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-55849-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55848-6
Online ISBN: 978-3-319-55849-3
eBook Packages: Computer ScienceComputer Science (R0)