Skip to main content
Log in

A Bayesian perspective of statistical machine learning for big data

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Statistical Machine Learning (SML) refers to a body of algorithms and methods by which computers are allowed to discover important features of input data sets which are often very large in size. The very task of feature discovery from data is essentially the meaning of the keyword ‘learning’ in SML. Theoretical justifications for the effectiveness of the SML algorithms are underpinned by sound principles from different disciplines, such as Computer Science and Statistics. The theoretical underpinnings particularly justified by statistical inference methods are together termed as statistical learning theory. This paper provides a review of SML from a Bayesian decision theoretic point of view—where we argue that many SML techniques are closely connected to making inference by using the so called Bayesian paradigm. We discuss many important SML techniques such as supervised and unsupervised learning, deep learning, online learning and Gaussian processes especially in the context of very large data sets where these are often employed. We present a dictionary which maps the key concepts of SML from Computer Science and Statistics. We illustrate the SML techniques with three moderately large data sets where we also discuss many practical implementation issues. Thus the review is especially targeted at statisticians and computer scientists who are aspiring to understand and apply SML for moderately large to big data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Al-Jarrah OY, Yoo PD, Muhaidat S, Karagiannidis GK, Taha K (2015) Efficient machine learning for Big Data: a review. Big Data Res 2:87–93

    Google Scholar 

  • Andrieu C, De Freitas N, Doucet A, Jordan MI (2003) An introduction to MCMC for machine learning. Mach Learn 50:5–43

    MATH  Google Scholar 

  • Berger JO (1993) Statistical decision theory and Bayesian analysis, 2nd edn. Springer series in statistics. Springer, New York

    Google Scholar 

  • Berger JO (2017) Sequential Analysis, vol 1–3. Palgrave Macmillan UK, London

    Google Scholar 

  • Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305

    MathSciNet  MATH  Google Scholar 

  • Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. Ann Stat 44:813–852

    MathSciNet  MATH  Google Scholar 

  • Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112:859–877

    MathSciNet  Google Scholar 

  • Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60:223–311

    MathSciNet  MATH  Google Scholar 

  • Bousquet O, Boucheron S, Lugosi G (2004) Introduction to statistical learning theory. Advanced lectures on machine learning. Springer, New York, pp 169–207

    MATH  Google Scholar 

  • Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

    MATH  Google Scholar 

  • Breiman L (2001a) Random forests. Mach Learn 45:5–32

    MATH  Google Scholar 

  • Breiman L (2001b) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16:199–231

    MATH  Google Scholar 

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Castro R (2018a) 2DI70 - Statistical learning theory, lecture notes. http://www.win.tue.nl/~rmcastro/2DI70/files/2DI70_Lecture_Notes.pdf. Accessed 8 Oct 2019

  • Castro R (2018b) ELEN6887: Complexity regularization and the squared loss. http://www.win.tue.nl/~rmcastro/6887_10/files/lecture11.pdf. Accessed 8 Oct 2019

  • Chapelle O, Scholkopf B, Zien A (2010) Semi supervised learning, vol 1. The MIT Press, Cambridge

    Google Scholar 

  • Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 785–794

  • Chen Z, Hruschka E, Liu B (2016) Lifelong machine learning and computer reading the web. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2117–2118

  • Chipman HA, George EI, McCulloch RE (2006) Bayesian ensemble learning. In: Proceedings of the 19th international conference on neural information processing systems. NIPS’06. MIT Press, Cambridge, pp 265–272

  • Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. McGraw-Hill, New York

    MATH  Google Scholar 

  • Das S, Dey D (2006) On Bayesian analysis of generalized linear models using Jacobian technique. Am Stat 60:265–268

    MathSciNet  Google Scholar 

  • Das S, Dey D (2010) On Bayesian inference for generalized multivariate gamma distribution. Stat Probab Lett 80:1492–1499

    MathSciNet  MATH  Google Scholar 

  • Das S, Dey D (2013) On dynamic generalized linear models with applications. Methodol Comput Appl Probab 15:407–421

    MathSciNet  MATH  Google Scholar 

  • Das S, Roy S, Sambasivan R (2018) Fast gaussian process regression for big data. Big Data Res 14:12–26

    Google Scholar 

  • Das S, Yang H, Banks D (2012) Synthetic priors that merge opinion from multiple experts. Stat Polit Policy 4:2151–7509

    Google Scholar 

  • Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113

    Google Scholar 

  • Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository, individual household electric power consumption data set. https://archive.ics.uci.edu/ml/machine-learning-databases/00235/. Accessed 8 Oct 2019

  • Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55:78–87

    Google Scholar 

  • Duvenaud D (2014) Automatic model construction with gaussian processes. University of Cambridge, Computational and Biological Learning Laboratory, PhD thesis

  • ForestScience (1998) Forest CoverType Dataset by Forest Science Department of Colorado State University. https://archive.ics.uci.edu/ml/datasets/covertype Data downloaded from UCI Machine Learning Repository. Accessed 8 Oct 2019

  • Foroughi F, Luksch P (2018) Data science methodology for Cybersecurity Projects. ArXiv preprint arXiv:1803.04219

  • Friedman JH (1998) Data mining and statistics: What’s the connection? Comput Sci Stat 29:3–9

    Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2009) The elements of statistical learning, 2nd edn. Springer series in statistics. Springer, New York

    MATH  Google Scholar 

  • Gammerman A, Vovk V, Vapnik V (1998) Learning by transduction. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, Burlington, pp 148–155

  • Gelfand AE, Dey DK (1994) Bayesian model choice: asymptotics and exact calculations. J R Stat Soc Ser B (Methodological) 56:501–514

    MathSciNet  MATH  Google Scholar 

  • Gelfand AE, Smith AFM (1990) Sampling-based approaches to calculating marginal densities. J Am Stat Assoc 85:398–409

    MathSciNet  MATH  Google Scholar 

  • Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis, 3rd edn. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Germain P, Lacasse A, Laviolette F, Marchand M (2009) PAC-Bayesian learning of linear classifiers. In: Proceedings of the 26th international conference on machine learning (ICML), pp 353–360

  • Gershman SJ, Blei DM (2012) A tutorial on Bayesian nonparametric models. J Math Psychol 56:1–12

    MathSciNet  MATH  Google Scholar 

  • Ghavamzadeh M, Mannor S, Pineau J, Tamar A (2015) Bayesian reinforcement learning: a survey. Found Trends Mach Learn 8:359–483

    MATH  Google Scholar 

  • Ghoshal S, Vaart AVD (2017) Fundamentals of nonparametric bayesian inference. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Goodfellow I (2018) Practical methodology for deploying machine learning. https://www.youtube.com/watch?v=NKiwFF_zBu4&t=1781s. Accessed 8 Oct 2019

  • Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press http://www.deeplearningbook.org. Accessed 8 Oct 2019

  • Google Research (2019) Quantum Computing, Quantum Computing, Google Research. https://www.ibm.com/quantum-computing/learn/what-is-quantum-computing/. Accessed 8 Oct 2019

  • Govindaraju V, Rao CR (2013) Machine learning: theory and applications. Elsevier, North Holland

    MATH  Google Scholar 

  • Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: 2010 IEEE computer society conference on computer vision and pattern recognition, CVPR 2010

  • Haussler D (1992) Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf Comput 100:78–150

    MathSciNet  MATH  Google Scholar 

  • Head M, Holman L, Lanfear R, Kahn A, Jennions M (2015) The extent and consequences of p-hacking in science. PLOS Biol 13:e1002106. https://doi.org/10.1371/journal.pbio.1002106

    Article  Google Scholar 

  • Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67

    MATH  Google Scholar 

  • Holzinger A (2014) On topological data mining. Interactive knowledge discovery and data mining in biomedical informatics. Springer, New York, pp 331–356

    Google Scholar 

  • IBM Q (2019) Quantum computing. https://www.ibm.com/quantum-computing/learn/what-is-quantum-computing/. Accessed 8 Oct 2019

  • Inmon B (2016) Data lake architecture: designing the data lake and avoiding the garbage dump. Technics Publications, New Jersy

    Google Scholar 

  • Jain P, Kar P (2017) Non-convex optimization for machine learning. Found Trends Mach Learn 10:142–336

    MATH  Google Scholar 

  • Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the 16th international conference on machine learning, ICML 99, pp 200–209

  • Kadane JB, Wasilkowski GW (1983) Average case-complexity in computer science: a Bayesian view. Technical Report

  • Karbalayghareh A, Qian X, Dougherty ER (2018) Optimal Bayesian transfer learning. IEEE Trans Signal Process 66:3724–3739

    MathSciNet  MATH  Google Scholar 

  • Kendall A, Gal Y (2017) What uncertainties do we need in bayesian deep learning for computer vision? In: 31st conference on neural information processing systems, NIPS 2017

  • Kimball R (2013) The data warehouse lifecycle toolkit: expert methods for designing, developing, and deploying data warehouses, 3rd edn. Wiley, New York

    Google Scholar 

  • Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York

    MATH  Google Scholar 

  • Larose DT (2005) Discovering knowledge in data: an introduction to data mining. Wiley, New Jersey

    MATH  Google Scholar 

  • Larose DT (2006) Data mining methods & models. Wiley, New York

    MATH  Google Scholar 

  • Laskov P, Gehl C, Krüger S, Müller K-R (2006) Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res 7:1909–1936

    MathSciNet  MATH  Google Scholar 

  • Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets, 2nd edn. Cambridge University Press, Cambridge

    Google Scholar 

  • Ĺheureux A, Grolinger K, Elyamany HF, Capretz MA (2017) Machine learning with big data: challenges and approaches. IEEE Access 5:7776–7797

    Google Scholar 

  • Li Q, Lin N (2010) The Bayesian elastic net. Bayesian Anal 5:151–170

    MathSciNet  MATH  Google Scholar 

  • Lichman M (2016) UCI machine learning repository. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/. Accessed 8 Oct 2019

  • Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the eleventh international conference, Rutgers University, New Brunswick, NJ, July 10–13, pp 157–163

  • Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5:716–727

    Google Scholar 

  • Lu Z, Monteiro RD, Yuan M (2012) Convex optimization methods for dimension reduction and coefficient estimation in multivariate linear regression. Math Program 131:163–194

    MathSciNet  MATH  Google Scholar 

  • Manfred O, Ole W (1999) A Bayesian approach to on-line learning. In: Saad D (ed) On-line learning in neural networks. Cambridge University Press, Cambridge, pp 363–379

    Google Scholar 

  • McCullagh P, Nelder J (1989) Generalized linear models, 2nd edn. Chapman and Hall/CRC, London

    MATH  Google Scholar 

  • McKinsey (2018) How companies are using big data and analytics, McKinsey & Company. https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/how-companies-are-using-big-data-and-analytics. Accessed 8 Oct 2019

  • Microsoft Research (2018) Microsoft Research Lab - Asia. https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/articles/machine-learning-research-hotspots/. Accessed 8 Oct 2019

  • Mitchell TM (2006) The discipline of machine learning, vol 9. Carnegie Mellon University, School of Computer Science, Machine Learning Department, Carnegie Mellon

    Google Scholar 

  • National Institute of Standards and Technology - US Department of Commerce (2018) NIST Big Data Interoperability Framework: Volume 1, Definitions. http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-1.pdf

  • Nocedal J, Wright S (2006) Numerical optimization, 2nd edn. Springer, New York

    MATH  Google Scholar 

  • Nowak R (2018) Statistical learning theory, Lecture 3. http://nowak.ece.wisc.edu/SLT09/lecture3.pdf. Accessed 8 Oct 2019

  • Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686

    MathSciNet  MATH  Google Scholar 

  • Park S, Choi S (2010) Hierarchical Gaussian process regression. In: ACML, pp 95–110

  • Pechyony D (2009) Theory and practice of transductive learning. Computer Science Department, PhD thesis, Technion

  • Pentina A, Lampert CH (2014) A PAC-Bayesian bound for lifelong learning. In: Proceedings of the 31st international conference on machine learning. ICML 14, vol 32, pp 991–999

  • Poupart P, Vlassis N, Hoey J, Regan K (2006) An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the 23rd international conference on Machine learning, ICML 06, pp 697–704

  • Pratt LY (1992) Discriminability-based transfer between neural networks. Adv Neural Inf Process Syst 5:204–211

    Google Scholar 

  • Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016:67

    Google Scholar 

  • Quadrianto N, Ghahramani Z (2015) A very simple safe-bayesian random forest. IEEE Trans Pattern Anal Mach Intell 37:1297–1303

    Google Scholar 

  • Rajaratnam B, Sparks D (2015) MCMC-based inference in the era of big data: a fundamental analysis of the convergence complexity of high-dimensional chains. https://arxiv.org/abs/1508.00947

  • Rasmussen CE, Williams C (2006) Gaussian processes for machine learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Ravi Kumar P (2014) Statistical machine learning and Big-p, Big-n, complex Data. http://uwtv.org/series/computer-science-engineering-lecture-series-2013/watch/IxNky5abdL8/. Accessed 8 Oct 2019

  • Sambasivan R, Das S (2017a) Big data regression using tree based segmentation. In: Proceedings of INDICON, IEEE

  • Sambasivan R, Das S (2017b) A statistical machine learning approach to yield curve forecasting. In: Proceedings of the international conference on computational intelligence in data science, IEEE

  • Settles B (2012) Active learning. Synth Lect Artif Intell Mach Learn 6:1–114

    MathSciNet  MATH  Google Scholar 

  • Shalev-Shwartz S (2007) Online learning: theory, algorithms, and applications, PhD thesis, Hebrew University

  • Shalev-Shwartz S, Singer Y (2008) Tutorial on theory and applications of online learning, Tutorial ICML

  • Sharma R, Das S (2017) Regularization and variable selection with copula prior. Corespondence https://arxiv.org/abs/1709.05514

  • Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Shinal J (2017) Google CEO Sundar PIchai: moving all directions at once. https://www.cnbc.com/2017/05/18/google-ceo-sundar-pichai-machine-learning-big-data.html. Accessed 8 Oct 2019

  • Shmueli G (2010) To explain or to predict? Stat Sci 25:289–310

    MathSciNet  MATH  Google Scholar 

  • Silver DL, Yang Q, Li L (2013) Lifelong machine learning systems: beyond learning algorithms. In: AAAI Spring Symposium: Lifelong Machine Learning, vol 13, pp 05

  • Snell J, Swersky K, Zemel RS (2017) Prototypical networks for few-shot learning. In: Proceedings of the 31st conference on neural information processing systems, NIPS

  • Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. Adv Neural Inf Process Syst 25:2951–2959

    Google Scholar 

  • Sutton RS, Barto AG (1998) Introduction to reinforcement learning, vol 135. MIT Press, Cambridge

    MATH  Google Scholar 

  • Therneau T, Atkinson B, Ripley B (2017) rpart: Recursive Partitioning and Regression Trees R package version 4.1-11

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288

    MathSciNet  MATH  Google Scholar 

  • Tibshirani R (2019) Lecture notes in statistical learning. http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf. Accessed 8 Oct 2019

  • Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11:443–482

    Google Scholar 

  • Torrey L, Shavlik J (2009) Transfer learning. In: Soria E, Martin J, Magdalena R, Martinez M, Serrano A (eds) Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, vol 242. IGI Global, Pennsylvania

    Google Scholar 

  • Tresp V (2000) A Bayesian committee machine. Neural Comput 12:2719–2741

    Google Scholar 

  • UC Berkeley (2018) Statistical machine learning, Univ of California at Berkeley. https://www.stat.berkeley.edu/~statlearning/. Accessed 8 Oct 2019

  • Van de Geer S (1990) Estimating a regression function. Ann Stat 18:907–924

    MathSciNet  MATH  Google Scholar 

  • Vapnik V (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  • Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Wiering M, van Otterlo M (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin

    Google Scholar 

  • Wasserman L (2004) All of statistics: a concise course in statistical inference. Springer Texts in Statistics. Springer, New York

    MATH  Google Scholar 

  • Williams C (2015) AI guru Ng: fearing a rise of killer robots is like worrying about overpopulation on Mars. https://www.theregister.co.uk/2015/03/19/andrew_ng_baidu_ai/. Accessed 8 Oct 2019

  • Wolpert DH (1992) Stacked generalization. Neural Netw 5:241–259

    Google Scholar 

  • Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390

    Google Scholar 

  • Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1:67–82

    Google Scholar 

  • Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107

    Google Scholar 

  • Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18:304–319

    Google Scholar 

  • Yang Y, Tokdar ST et al (2015) Minimax-optimal nonparametric regression in high dimensions. Ann Stat 43:652–674

    MathSciNet  MATH  Google Scholar 

  • Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Proceedings of the 27th international conference on neural information processing systems, pp 3320–3328

  • Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59:56–65

    Google Scholar 

  • Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on Machine learning, vol 116. ACM

  • Zhiyuan Chen C, Hruschka E, Liu B (2016) KDD 2016 Tutorials - YouTube. http://www.youtube.com/playlist?list=PLvM6T5w9YQBL6rP1-vGqhAa-SQ84KVv0c. Accessed 8 Oct 2019

  • Zhu J, Chen J, Hu W, Zhang B (2017) Big learning with Bayesian methods. Natl Sci Rev 4:627–651

    Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Sourish Das’s research has been supported by an Infosys Foundation Grant and a TATA Trust Grant to CMI and also by a UK Government funded Commonwealth-Rutherford Scholarship (Grant No. RF 2017-123).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sourish Das.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sambasivan, R., Das, S. & Sahu, S.K. A Bayesian perspective of statistical machine learning for big data. Comput Stat 35, 893–930 (2020). https://doi.org/10.1007/s00180-020-00970-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-00970-8

Keywords

Navigation