Random Forest Algorithm for Prediction of HIV Drug Resistance

Raposo, Letícia M.; Rosa, Paulo Tadeu C. R.; Nobre, Flavio F.

doi:10.1007/978-3-030-38021-2_6

Letícia M. Raposo³,
Paulo Tadeu C. R. Rosa³ &
Flavio F. Nobre³

Part of the book series: STEAM-H: Science, Technology, Engineering, Agriculture, Mathematics & Health ((STEAM))

466 Accesses
3 Citations

Abstract

Random forest algorithm is a popular choice for genomic data analysis and bioinformatics research. The fundamental idea behind this technique is to combine many decision trees into a single model and use the random subspace method for selection of predictor variables. It is a nonparametric algorithm, efficient for both regression and classification problems, and has a good predictive performance for many types of data. This chapter describes the general characteristics of the random forest algorithm, showing, in practice, a comprehensive application of how this approach can be applied to predict HIV-1 drug resistance. The random forest results were compared to the other two models, logistic regression and classification tree, and presented lower variability in its results, showing to be a classifier with greater stability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mutalib, S., & Mohamed, A. (2011). A brief survey on GWAS and ML algorithms. In 2011 11th International Conference on Hybrid Intelligent Systems (HIS) (pp. 658–661). Piscataway: IEEE.
Chapter Google Scholar
Szymczak, S., Biernacka, J. M., Cordell, H. J., et al. (2009). Machine learning in genome-wide association studies. Genetic Epidemiology, 33, S51–S57. https://doi.org/10.1002/gepi.20473.
Article Google Scholar
Swan, A. L., Mobasheri, A., Allaway, D., et al. (2013). Application of machine learning to proteomics data: Classification and biomarker identification in postgenomics biology. OMICS, 17, 595–610. https://doi.org/10.1089/omi.2013.0017.
Article Google Scholar
Barla, A., Jurman, G., Riccadonna, S., et al. (2007). Machine learning methods for predictive proteomics. Briefings in Bioinformatics, 9, 119–128. https://doi.org/10.1093/bib/bbn008.
Article Google Scholar
Wale, N. (2011). Machine learning in drug discovery and development. Drug Development Research, 72, 112–119. https://doi.org/10.1002/ddr.20407.
Article Google Scholar
Lima, A. N., Philot, E. A., Trossini, G. H. G., et al. (2016). Use of machine learning approaches for novel drug discovery. Expert Opinion on Drug Discovery, 11, 225–239. https://doi.org/10.1517/17460441.2016.1146250.
Article Google Scholar
Kourou, K., Exarchos, T. P., Exarchos, K. P., et al. (2015). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13, 8–17. https://doi.org/10.1016/J.CSBJ.2014.11.005.
Article Google Scholar
Kononenko, I. (2001). Machine learning for medical diagnosis: History, state of the art and perspective. Artificial Intelligence in Medicine, 23, 89–109. https://doi.org/10.1016/S0933-3657(01)00077-X.
Article Google Scholar
Najami, M., Abedallah, N., & Khalifa, L. (2014). Computational approaches for bio-marker discovery. Journal of Intelligent Learning Systems and Applications, 6, 153–161. https://doi.org/10.4236/jilsa.2014.64012.
Article Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324.
Article MATH Google Scholar
Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 832–844. https://doi.org/10.1109/34.709601.
Article Google Scholar
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. https://doi.org/10.1007/BF00058655.
Article MATH Google Scholar
Díaz-Uriarte, R., & Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3. https://doi.org/10.1186/1471-2105-7-3.
Article Google Scholar
Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 25. https://doi.org/10.1186/1471-2105-8-25.
Article Google Scholar
Hsueh, H.-M., Zhou, D.-W., & Tsai, C.-A. (2013). Random forests-based differential analysis of gene sets for gene expression data. Gene, 518, 179–186. https://doi.org/10.1016/J.GENE.2012.11.034.
Article Google Scholar
Wu, X., Wu, Z., & Li, K. (2008). Identification of differential gene expression for microarray data using recursive random forest. Chinese Medical Journal, 121, 2492–2496.
Article Google Scholar
Montaño-Gutierrez, L. F., Ohta, S., Kustatscher, G., et al. (2017). Nano Random Forests to mine protein complexes and their relationships in quantitative proteomics data. Molecular Biology of the Cell, 28, 673–680. https://doi.org/10.1091/mbc.e16-06-0370.
Article Google Scholar
Cao, Z. W., Han, L. Y., Zheng, C. J., et al. (2005). Computer prediction of drug resistance mutations in proteins. Drug Discovery Today, 10, 521–529. https://doi.org/10.1016/S1359-6446(05)03377-5.
Article Google Scholar
Chen, T., Cao, Y., Zhang, Y., et al. (2013). Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection. Evidence-based Complementary and Alternative Medicine, 2013, 298183. https://doi.org/10.1155/2013/298183.
Article Google Scholar
Abdullah, M. N., Yap, B. W., Zakaria, Y., & Abdul Majeed, A. B. (2016). Metabolites selection and classification of metabolomics data on Alzheimer’s disease using random forest. In M. Berry, A. Hj Mohamed, & B. Yap (Eds.), Soft computing in data science. SCDS 2016. Communications in Computer and Information Science (Vol. 652, pp. 100–112). Singapore: Springer.
Google Scholar
Goldstein, B. A., Hubbard, A. E., Cutler, A., & Barcellos, L. F. (2010). An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings. BMC Genetics, 11, 49. https://doi.org/10.1186/1471-2156-11-49.
Article Google Scholar
Goldstein, B. A., Polley, E. C., & Briggs, F. B. S. (2011). Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology, 10, 32. https://doi.org/10.2202/1544-6115.1691.
Article MathSciNet MATH Google Scholar
Nguyen, T.-T., Huang, J., Wu, Q., et al. (2015). Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests. BMC Genomics, 16, S5. https://doi.org/10.1186/1471-2164-16-S2-S5.
Article Google Scholar
Shen, C., Yu, X., Harrison, R. W., & Weber, I. T. (2016). Automated prediction of HIV drug resistance from genotype data. BMC Bioinformatics, 17, 278. https://doi.org/10.1186/s12859-016-1114-6.
Article Google Scholar
Heider, D., Verheyen, J., & Hoffmann, D. (2010). Predicting Bevirimat resistance of HIV-1 from genotype. BMC Bioinformatics, 11, 37. https://doi.org/10.1186/1471-2105-11-37.
Article Google Scholar
Wang, D., Larder, B., Revell, A., et al. (2009). A comparison of three computational modelling methods for the prediction of virological response to combination HIV therapy. Artificial Intelligence in Medicine, 47, 63–74. https://doi.org/10.1016/J.ARTMED.2009.05.002.
Article Google Scholar
Khalid, Z., & Sezerman, O. U. (2016). Prediction of HIV drug resistance by combining sequence and structural properties. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 15, 966–973. https://doi.org/10.1109/TCBB.2016.2638821.
Article Google Scholar
Tarasova, O., Biziukova, N., Filimonov, D., et al. (2018). A computational approach for the prediction of HIV resistance based on amino acid and nucleotide descriptors. Molecules, 23, 2751. https://doi.org/10.3390/molecules23112751.
Article Google Scholar
Revell, A. D., Wang, D., Perez-Elias, M.-J., et al. (2018). 2018 update to the HIV-TRePS system: The development of new computational models to predict HIV treatment outcomes, with or without a genotype, with enhanced usability for low-income settings. The Journal of Antimicrobial Chemotherapy, 73, 2186–2196. https://doi.org/10.1093/jac/dky179.
Article Google Scholar
Bronze, M., Steegen, K., Wallis, C. L., et al. (2012). HIV-1 phenotypic reverse transcriptase inhibitor drug resistance test interpretation is not dependent on the subtype of the virus backbone. PLoS One, 7, e34708. https://doi.org/10.1371/journal.pone.0034708.
Article Google Scholar
Beerenwinkel, N., Schmidt, B., Walter, H., et al. (2002). Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype. Proceedings of the National Academy of Sciences of the United States of America, 99, 8271–8276. https://doi.org/10.1073/pnas.112177799.
Article Google Scholar
Vercauteren, J., & Vandamme, A. M. (2006). Algorithms for the interpretation of HIV-1 genotypic drug resistance information. Antiviral Research, 71, 335–342. https://doi.org/10.1016/j.antiviral.2006.05.003.
Article Google Scholar
Schutten, M. (2006). Resistance assays. In A. M. Geretti (Ed.), Antiretroviral resistance in clinical practice. London: Mediscript.
Google Scholar
Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont: Wadsworth.
MATH Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning data mining, inference, And prediction. New York: Springer.
MATH Google Scholar
Efron, B., & Tibshirani, R. (1994). An introduction to the bootstrap. New York: Chapman & Hall.
MATH Google Scholar
Cutler, A., Cutler, D. R., & Stevens, J. R. (2012). Random forests. In Ensemble machine learning (pp. 157–175). Boston: Springer US.
Chapter Google Scholar
Tibshirani, R., & Tibshirani, R. (1996). Bias, variance and prediction error for classification rules. Toronto: University of Toronto.
MATH Google Scholar
Wolpert, D. H., & Macready, W. G. (1999). An efficient method to estimate bagging’s generalization error. Machine Learning, 35, 41–55. https://doi.org/10.1023/A:1007519102914.
Article MATH Google Scholar
Breiman, L. (1996). Out-of-bag estimation. Berkeley, CA.
Google Scholar
Janitza, S., & Hornung, R. (2018). On the overestimation of random forest’s out-of-bag error. PLoS One, 13, e0201904. https://doi.org/10.1371/journal.pone.0201904.
Article Google Scholar
Breiman, L., & Cutler, A. (2004). RFtools – for predicting and understanding data. Berkeley University, Berkeley, CA.
Google Scholar
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2, 18–22.
Google Scholar
Janitza, S., Celik, E., & Boulesteix, A.-L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12, 885–915. https://doi.org/10.1007/s11634-016-0276-4.
Article MathSciNet MATH Google Scholar
Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3.1. Berkeley, CA.
Google Scholar
Nicodemus, K. K. (2011). Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures. Briefings in Bioinformatics, 12, 369–373. https://doi.org/10.1093/bib/bbr016.
Article Google Scholar
Nicodemus, K. K., Malley, J. D., Strobl, C., & Ziegler, A. (2010). The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics, 11, 110. https://doi.org/10.1186/1471-2105-11-110.
Article Google Scholar
Szymczak, S., Holzinger, E., Dasgupta, A., et al. (2016). r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Mining, 9, 7. https://doi.org/10.1186/s13040-016-0087-3.
Article Google Scholar
Ziegler, A., & König, I. R. (2014). Mining data with random forests: Current options for real-world applications. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4, 55–63. https://doi.org/10.1002/widm.1114.
Article Google Scholar
Zhang, J., Zulkernine, M., & Haque, A. (2008). Random-forests-based network intrusion detection systems. IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews, 38, 649–659. https://doi.org/10.1109/TSMCC.2008.923876.
Article Google Scholar
Breiman, L., & Cutler, A. Random forests – classification description. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prox. Accessed 19 Dec 2018.
Pawar, S. D., Freas, C., Weber, I. T., & Harrison, R. W. (2018). Analysis of drug resistance in HIV protease. BMC Bioinformatics, 19, 362. https://doi.org/10.1186/s12859-018-2331-y.
Article Google Scholar
Singh, Y. (2017). Machine learning to improve the effectiveness of ANRS in predicting HIV drug resistance. Healthcare Informatics Research, 23, 271. https://doi.org/10.4258/hir.2017.23.4.271.
Article Google Scholar
Raposo, L. M. L. M., & Nobre, F. F. F. F. (2017). Ensemble classifiers for predicting HIV-1 resistance from three rule-based genotypic resistance interpretation systems. Journal of Medical Systems, 41, 155. https://doi.org/10.1007/s10916-017-0802-8.
Article Google Scholar
Geretti, A. M., & National Center for Biotechnology Information (U.S.). (2006). Antiretroviral resistance in clinical practice. London: Mediscript Ltd.
Google Scholar
Winters, B., Montaner, J., Harrigan, P. R., et al. (2008). Determination of clinically relevant cutoffs for HIV-1 phenotypic resistance estimates through a combined analysis of clinical trial and cohort data. JAIDS Journal of Acquired Immune Deficiency Syndromes, 48, 26–34. https://doi.org/10.1097/QAI.0b013e31816d9bf4.
Article Google Scholar
Reeves, J. D., & Parkin, N. T. (2017). Viral phenotypic resistance assays. In Antimicrobial drug resistance (pp. 1389–1407). Cham: Springer International Publishing.
Chapter Google Scholar
Bozek, K., Lengauer, T., Sierra, S., et al. (2013). Analysis of physicochemical and structural properties determining HIV-1 coreceptor usage. PLoS Computational Biology, 9, e1002977. https://doi.org/10.1371/journal.pcbi.1002977.
Article Google Scholar
Rö Gnvaldsson, T., You, L., & Garwicz, D. (2015). State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics, 31(8), 1204–1210. https://doi.org/10.1093/bioinformatics/btu810.
Article Google Scholar
Sheik Amamuddy, O., Bishop, N. T., & Tastan Bishop, Ö. (2017). Improving fold resistance prediction of HIV-1 against protease and reverse transcriptase inhibitors using artificial neural networks. BMC Bioinformatics, 18, 369. https://doi.org/10.1186/s12859-017-1782-x.
Article Google Scholar
Van der Borght, K., Verheyen, A., Feyaerts, M., et al. (2013). Quantitative prediction of integrase inhibitor resistance from genotype through consensus linear regression modeling. Virology Journal, 10, 8. https://doi.org/10.1186/1743-422X-10-8.
Article Google Scholar
Dybowski, J. N., Riemenschneider, M., Hauke, S., et al. (2011). Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers. BioData Mining, 4, 26. https://doi.org/10.1186/1756-0381-4-26.
Article Google Scholar
Riemenschneider, M., Hummel, T., & Heider, D. (2016). SHIVA – a web application for drug resistance and tropism testing in HIV. BMC Bioinformatics, 17, 314. https://doi.org/10.1186/s12859-016-1179-2.
Article Google Scholar
Kawashima, S., & Kanehisa, M. (2000). AAindex: Amino acid index database. Nucleic Acids Research, 28, 374–374. https://doi.org/10.1093/nar/28.1.374.
Article Google Scholar
Riemenschneider, M., Cashin, K. Y., Budeus, B., et al. (2016). Genotypic prediction of co-receptor tropism of HIV-1 subtypes A and C. Scientific Reports, 6, 24883. https://doi.org/10.1038/srep24883.
Article Google Scholar
Heider, D., Dybowski, J. N., Wilms, C., & Hoffmann, D. (2014). A simple structure-based model for the prediction of HIV-1 co-receptor tropism. BioData Mining, 7, 14. https://doi.org/10.1186/1756-0381-7-14.
Article Google Scholar
Kuhn, M. (2016). Package “caret.” ftp://cran.r-project.org/pub/R/web/packages/caret/caret.pdf. Accessed 20 Feb 2017.
Stanford University – HIV Drug Resistance Database. (2016). PI resistance notes – HIV Drug Resistance Database. https://hivdb.stanford.edu/dr-summary/resistance-notes/PI/. Accessed 27 Dec 2018.

Download references

Author information

Authors and Affiliations

Programa de Engenharia Biomédica, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
Letícia M. Raposo, Paulo Tadeu C. R. Rosa & Flavio F. Nobre

Authors

Letícia M. Raposo
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Tadeu C. R. Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Flavio F. Nobre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Letícia M. Raposo .

Editor information

Editors and Affiliations

Electrical Engineering Department, Universidad Autónoma Metropolitana-Iztapalapa, Mexico City, Mexico
Martha Refugio Ortiz-Posadas

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Raposo, L.M., Rosa, P.T.C.R., Nobre, F.F. (2020). Random Forest Algorithm for Prediction of HIV Drug Resistance. In: Ortiz-Posadas, M. (eds) Pattern Recognition Techniques Applied to Biomedical Problems. STEAM-H: Science, Technology, Engineering, Agriculture, Mathematics & Health. Springer, Cham. https://doi.org/10.1007/978-3-030-38021-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-38021-2_6
Published: 01 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38020-5
Online ISBN: 978-3-030-38021-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics