How Many Trees in a Random Forest?

Oshiro, Thais Mayumi; Perez, Pedro Santoro; Baranauskas, José Augusto

doi:10.1007/978-3-642-31537-4_13

Thais Mayumi Oshiro²⁰,
Pedro Santoro Perez²⁰ &
José Augusto Baranauskas²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7376))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

9816 Accesses
364 Citations
5 Altmetric

Abstract

Random Forest is a computationally efficient technique that can operate quickly over large datasets. It has been used in many recent research projects and real-world applications in diverse domains. However, the associated literature provides almost no directions about how many trees should be used to compose a Random Forest. The research reported here analyzes whether there is an optimal number of trees within a Random Forest, i.e., a threshold from which increasing the number of trees would bring no significant performance gain, and would only increase the computational cost. Our main conclusions are: as the number of trees grows, it does not always mean the performance of the forest is significantly better than previous forests (fewer trees), and doubling the number of trees is worthless. It is also possible to state there is a threshold beyond which there is no significant gain, unless a huge computational environment is available. In addition, it was found an experimental relationship for the AUC gain when doubling the number of trees in any forest. Furthermore, as the number of trees grows, the full set of attributes tend to be used within a Random Forest, which may not be interesting in the biomedical domain. Additionally, datasets’ density-based metrics proposed here probably capture some aspects of the VC dimension on decision trees and low-density datasets may require large capacity machines whilst the opposite also seems to be true.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cancer program data sets. Broad Institute (2010), http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
Dataset repository in arff (weka). BioInformatics Group Seville (2010), http://www.upo.es/eps/bigs/datasets.html
Datasets. Cilab (2010), http://cilab.ujn.edu.cn/datasets.htm
Aslan, O., Yildiz, O.T., Alpaydin, E.: Calculating the VC-dimension of decision trees. In: International Symposium on Computer and Information Sciences 2009, pp. 193–198 (2009)
Google Scholar
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300 (1995)
MathSciNet MATH Google Scholar
Berks, M., Chen, Z., Astley, S., Taylor, C.: Detecting and Classifying Linear Structures in Mammograms Using Random Forests. In: Székely, G., Hahn, H.K. (eds.) IPMI 2011. LNCS, vol. 6801, pp. 510–524. Springer, Heidelberg (2011)
Chapter Google Scholar
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
MathSciNet MATH Google Scholar
Demšar, J.: Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning Research 7(1), 1–30 (2006)
MATH Google Scholar
Dubath, P., Rimoldini, L., Süveges, M., Blomme, J., López, M., Sarro, L.M., De Ridder, J., Cuypers, J., Guy, L., Lecoeur, I., Nienartowicz, K., Jan, A., Beck, M., Mowlavi, N., De Cat, P., Lebzelter, T., Eyer, L.: Random forest automated supervised classification of hipparcos periodic variable stars. Monthly Notices of the Royal Astronomical Society 414(3), 2602–2617 (2011), http://dx.doi.org/10.1111/j.1365-2966.2011.18575.x
Article Google Scholar
Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 123–140. Morgan Kaufmann, Lake Tahoe (1996)
Google Scholar
Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11(1), 86–92 (1940)
Article MATH Google Scholar
Gamberger, D., Lavrač, N., Zelezny, F., Tolar, J.: Induction of comprehensible models for gene expression datasets by subgroup discovery methodology. Journal of Biomedical Informatics 37, 269–284 (2004)
Article Google Scholar
Geremia, E., Menze, B.H., Clatz, O., Konukoglu, E., Criminisi, A., Ayache, N.: Spatial Decision Forests for MS Lesion Segmentation in Multi-Channel MR Images. In: Jiang, T., Navab, N., Pluim, J.P.W., Viergever, M.A. (eds.) MICCAI 2010. LNCS, vol. 6361, pp. 111–118. Springer, Heidelberg (2010)
Chapter Google Scholar
Goldstein, B., Hubbard, A., Cutler, A., Barcellos, L.: An application of random forests to a genome-wide association dataset: Methodological considerations and new findings. BMC Genetics 11(1), 49 (2010), http://www.biomedcentral.com/1471-2156/11/49
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining Explor. Newsl. 11(1), 10–18 (2009)
Google Scholar
Hsieh, C., Lu, R., Lee, N., Chiu, W., Hsu, M., Li, Y.J.: Novel solutions for an old disease: diagnosis of acute appendicitis with random forest, support vector machines, and artificial neural networks. Surgery 149(1), 87–93 (2011)
Article Google Scholar
Kim, S.-H., Lee, J.-H., Ko, B., Nam, J.-Y.: X-ray image classification using random forests with local binary patterns. In: International Conference on Machine Learning and Cybernetics 2010, pp. 3190–3194 (2010)
Google Scholar
Latinne, P., Debeir, O., Decaestecker, C.: Limiting the Number of Trees in Random Forests. In: Kittler, J., Roli, F. (eds.) MCS 2001. LNCS, vol. 2096, pp. 178–187. Springer, Heidelberg (2001)
Chapter Google Scholar
Lee, J.H., Kim, D.Y., Ko, B.C., Nam, J.Y.: Keyword annotation of medical image with random forest classifier and confidence assigning. In: International Conference on Computer Graphics, Imaging and Visualization, pp. 156–159 (2011)
Google Scholar
Lempitsky, V., Verhoek, M., Noble, J.A., Blake, A.: Random Forest Classification for Automatic Delineation of Myocardium in Real-Time 3D Echocardiography. In: Ayache, N., Delingette, H., Sermesant, M. (eds.) FIMH 2009. LNCS, vol. 5528, pp. 447–456. Springer, Heidelberg (2009)
Chapter Google Scholar
Leshem, G.: Improvement of adaboost algorithm by using random forests as weak learner and using this algorithm as statistics machine learning for traffic flow prediction. Research proposal for a Ph.D. Thesis (2005)
Google Scholar
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2/3, 1–5 (2002)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill (1997)
Google Scholar
Netto, O.P., Nozawa, S.R., Mitrowsky, R.A.R., Macedo, A.A., Baranauskas, J.A.: Applying decision trees to gene expression data from dna microarrays: A leukemia case study. In: XXX Congress of the Brazilian Computer Society, X Workshop on Medical Informatics, p. 10. Belo Horizonte, MG (2010)
Google Scholar
Perez, P.S., Baranauskas, J.A.: Analysis of decision tree pruning using windowing in medical datasets with different class distributions. In: Proceedings of the Workshop on Knowledge Discovery in Health Care and Medicine of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD KDHCM), Athens, Greece, pp. 28–39 (2011)
Google Scholar
Sirikulviriya, N., Sinthupinyo, S.: Integration of rules from a random forest. In: International Conference on Information and Electronics Engineering, vol. 6, pp. 194–198 (2011)
Google Scholar
Tang, Y.: Real-Time Automatic Face Tracking Using Adaptive Random Forests. Master’s thesis, Department of Electrical and Computer Engineering McGill University, Montreal, Canada (June 2010)
Google Scholar
Vapnik, V., Levin, E., Cun, Y.L.: Measuring the vc-dimension of a learning machine. Neural Computation 6, 851–876 (1994)
Article Google Scholar
Wang, G., Hao, J., Ma, J., Jiang, H.: A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications 38, 223–230 (2011)
Article Google Scholar
Yaqub, M., Mahon, P., Javaid, M.K., Cooper, C., Noble, J.A.: Weighted voting in 3d random forest segmentation. Medical Image Understanding and Analysis (2010)
Google Scholar
Yaqub, M., Javaid, M.K., Cooper, C., Noble, J.A.: Improving the Classification Accuracy of the Classic RF Method by Intelligent Feature Selection and Weighted Voting of Trees with Application to Medical Image Segmentation. In: Suzuki, K., Wang, F., Shen, D., Yan, P. (eds.) MLMI 2011. LNCS, vol. 7009, pp. 184–192. Springer, Heidelberg (2011)
Chapter Google Scholar
Yi, Z., Criminisi, A., Shotton, J., Blake, A.: Discriminative, Semantic Segmentation of Brain Tissue in MR Images. In: Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (eds.) MICCAI 2009. LNCS, vol. 5762, pp. 558–565. Springer, Heidelberg (2009)
Chapter Google Scholar
Zhao, Y., Zhang, Y.: Comparison of decision tree methods for finding active objects. Advances in Space Research 41, 1955–1959 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Mathematics, Faculty of Philosophy, Sciences and Languages at Ribeirao Preto, University of Sao Paulo, Brazil
Thais Mayumi Oshiro, Pedro Santoro Perez & José Augusto Baranauskas

Authors

Thais Mayumi Oshiro
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Santoro Perez
View author publications
You can also search for this author in PubMed Google Scholar
José Augusto Baranauskas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, IBaI, Kohlenstraße 2, 04107, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oshiro, T.M., Perez, P.S., Baranauskas, J.A. (2012). How Many Trees in a Random Forest?. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2012. Lecture Notes in Computer Science(), vol 7376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31537-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-31537-4_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31536-7
Online ISBN: 978-3-642-31537-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics