skip to main content
research-article

Toward a Progress Indicator for Machine Learning Model Building and Data Mining Algorithm Execution: A Position Paper

Published:21 November 2017Publication History
Skip Abstract Section

Abstract

For user-friendliness, many software systems offer progress indicators for long-duration tasks. A typical progress indicator continuously estimates the remaining task execution time as well as the portion of the task that has been finished. Building a machine learning model often takes a long time, but no existing machine learning software supplies a non-trivial progress indicator. Similarly, running a data mining algorithm often takes a long time, but no existing data mining software provides a nontrivial progress indicator. In this article, we consider the problem of offering progress indicators for machine learning model building and data mining algorithm execution. We discuss the goals and challenges intrinsic to this problem. Then we describe an initial framework for implementing such progress indicators and two advanced, potential uses of them, with the goal of inspiring future research on this topic

References

  1. A progress bar for scikit-learn? https://stackoverflow.com/questions/34251980/a-progressbar-for-scikit-learn.Google ScholarGoogle Scholar
  2. Aggarwal, C.C. Data Mining: The Textbook. New York, NY: Springer 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, R., Srikant, R. Fast algorithms for mining association rules in large databases. In: Proc. VLDB, 1994, pp. 487--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alpaydin, E. Introduction to Machine Learning. Cambridge, MA: The MIT Press 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Babich, N. Best practices for animated progress indicators. https://www.smashingmagazine.com/2016/12/best-practicesfor-animated-progress-indicators/.Google ScholarGoogle Scholar
  6. Bekkerman, R., Bilenko, M., Langford, J. Scaling up Machine Learning: Parallel and Distributed Approaches. New York, NY: Cambridge University Press 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bergstra, J., Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research 2012;13:281--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Berque, D.A., Goldberg, M.K. Monitoring an algorithm's execution. Computational Support for Discrete Mathematics, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 15, 1992:153--63.Google ScholarGoogle ScholarCross RefCross Ref
  9. Bottou, L., Chapelle, O., DeCoste, D., Weston, J. Large Scale Kernel Machines. Cambridge, MA: MIT Press 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Brazdil, P., Soares, C., da Costa, J.P. Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Machine Learning 2003;50(3):251--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chaudhuri, S., Narasayya, V.R., Ramamurthy, R. Estimating progress of long running SQL queries. In: Proc. SIGMOD, 2004, pp. 803--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chi, Y., Moon, H.J., Hacigümüs, H., Tatemura, J. SLA-tree: a framework for efficiently supporting SLA-based decisions in cloud computing. In: Proc. EDBT, 2011, pp. 129--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Delimitrou, C., Kozyrakis, C. QoS-aware scheduling in heterogeneous datacenters with Paragon. ACM Trans Comput Syst 2013;31(4):12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Delimitrou, C., Kozyrakis, C. Quasar: resource-efficient and QoS-aware cluster management. In: Proc. ASPLOS, 2014, pp. 127--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Doan, T., Kalita, J. Predicting run time of classification algorithms using meta-learning approach. International Journal of Machine Learning and Cybernetics, 2016.Google ScholarGoogle Scholar
  16. Ferguson, A.D., Bodík, P., Kandula, S., Boutin, E., Fonseca, R. Jockey: guaranteed job latency in data parallel clusters. In: Proc. EuroSys, 2012, pp. 99--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F. Efficient and robust automated machine learning. In: Proc. NIPS, 2015, pp. 2944--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Feurer, M., Springenberg, T., Hutter, F. Initializing Bayesian hyperparameter optimization via meta-learning. In: Proc. AAAI, 2015, pp. 1128--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Figueroa, R.L., Zeng-Treitler, Q., Kandula, S., Ngo, L.H. Predicting sample size required for classification performance. BMC Med Inform Decis Mak 2012;12:8.Google ScholarGoogle Scholar
  20. Flajolet, P., Steyaert, J. A complexity calculus for recursive tree algorithms. Mathematical Systems Theory 1987;19(4):301--31.Google ScholarGoogle Scholar
  21. Gandhi, A., Thota, S., Dube, P., Kochut, A., Zhang, L. Autoscaling for Hadoop clusters. In: Proc. IC2E, 2016, pp. 109--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gupta, C., Mehta, A., Dayal, U. PQR: predicting query execution times for autonomous workload management. In: Proc. ICAC, 2008, pp. 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Han, J., Pei, J., Yin, Y. Mining frequent patterns without candidate generation. In: Proc. SIGMOD, 2000, pp. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Herodotou, H., Dong, F., Babu, S. No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: Proc. SoCC, 2011, 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hickey, T.J., Cohen, J. Automating program analysis. Journal of the ACM 1988;35(1):185--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hickins, M. Citizen data scientists unite! http://www.forbes.com/sites/oracle/2016/10/03/citizen-datascientists-unite.Google ScholarGoogle Scholar
  27. Hu, Y., Sundara, S., Srinivasan, J. Supporting timeconstrained SQL queries in Oracle. In: Proc. VLDB, 2007, pp. 1207--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Huang, B., Boehm, M., Tian, Y., Reinwald, B., Tatikonda, S., Reiss, F.R. Resource elasticity for large-scale machine learning. In: Proc. SIGMOD, 2015, pp. 137--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Huang, L., Jia, J., Yu, B., Chun, B., Maniatis, P., Naik, M. Predicting execution time of computer programs using sparse polynomial regression. In: Proc. NIPS, 2010, pp. 883--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K. Algorithm runtime prediction: methods & evaluation. Artif Intell 2014;206:79--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jalaparti, V., Ballani, H., Costa, P., Karagiannis, T., Rowstron, A. Bridging the tenant-provider gap in cloud services. In: Proc. SoCC, 2012, 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jovic, A., Brkic, K., Bogunovic, N. An overview of free software tools for general data mining. In: Proc. MIPRO, 2014, pp. 1112--7.Google ScholarGoogle ScholarCross RefCross Ref
  33. Kanter, J.M., Gillespie, O., Veeramachaneni, K. Label, segment, featurize: a cross domain framework for prediction engineering. In: Proc. DSAA, 2016, pp. 430--9.Google ScholarGoogle Scholar
  34. Kao, B., García-Molina, H. An overview of real-time database systems. In: Proc. NATO ASI RTC, 1992, pp. 261--82.Google ScholarGoogle Scholar
  35. Keras integration with TQDM progress bars. https://github.com/bstriner/keras-tqdm.Google ScholarGoogle Scholar
  36. Lam, H.T., Thiebaut, J., Sinn, M., Chen, B., Mai, T., Alkan, O. One button machine for automating feature engineering in relational databases. CoRR abs/1706.00327, 2017.Google ScholarGoogle Scholar
  37. Lee, K., König, A.C., Narasayya, V.R., Ding, B., Chaudhuri, S., Ellwein, B., Eksarevskiy, A., Kohli, M., Wyant, J., Prakash, P., Nehme, R.V., Li, J., Naughton, J.F. Operator and query progress estimation in Microsoft SQL Server Live Query Statistics. In: Proc. SIGMOD, 2016, pp. 1753--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lee, W., Oh, H., Yi, K. A progress bar for static analyzers. In: Proc. SAS, 2014, pp. 184--200.Google ScholarGoogle ScholarCross RefCross Ref
  39. Lee, B., Schopf, J.M. Run-time prediction of parallel applications on shared environments. In: Proc. CLUSTER, 2003, pp. 487--91.Google ScholarGoogle Scholar
  40. Luo, G. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform 2016;5:18.Google ScholarGoogle Scholar
  41. Luo, G. PredicT-ML: a tool for automating machine learning model building with big clinical data. Health Inf Sci Syst 2016;4:5.Google ScholarGoogle Scholar
  42. Luo, G., Chen, T., Yu, H. Toward a progress indicator for program compilation. Software: Practice and Experience 2007;37(9):909--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Luo, G., Naughton, J.F., Ellmann, C.J., Watzke, M. Toward a progress indicator for database queries. In: Proc. SIGMOD, 2004, pp. 791--802. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Luo, G., Naughton, J.F., Ellmann, C.J., Watzke, M. Increasing the accuracy and coverage of SQL progress indicators. In: Proc. ICDE, 2005, pp. 853--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Luo, G., Naughton, J.F., Yu, P.S. Multi-query SQL progress indicators. In: Proc. EDBT, 2006, pp. 921--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Luo, G., Stone, B.L., Johnson, M.D., Tarczy-Hornoch, P., Wilcox, A.B., Mooney, S.D., Sheng, X., Haug, P.J., Nkoy, F.L. Automating construction of machine learning models with clinical big data: proposal rationale and methods. JMIR Res Protoc 2017;6(8):e175.Google ScholarGoogle Scholar
  47. Morton, K., Balazinska, M., Grossman, D. ParaTimer: a progress indicator for MapReduce DAGs. In: Proc. SIGMOD, 2010, pp. 507--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Morton, K., Friesen, A.L., Balazinska, M., Grossman, D. Estimating the progress of MapReduce pipelines. In: Proc. ICDE, 2010, pp. 681--4.Google ScholarGoogle ScholarCross RefCross Ref
  49. Myers, B.A. The importance of percent-done progress indicators for computer-human interfaces. In: Proc. SIGCHI, 1985, pp. 11--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Nielsen, J. Usability Engineering. San Francisco, CA: Morgan Kaufmann 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Nocedal, J., Wright, S. Numerical Optimization, 2nd ed. New York, NY: Springer 2006.Google ScholarGoogle Scholar
  52. Ortiz, J., de Almeida, V.T., Balazinska, M. Changing the face of database cloud services with personalized service level agreements. In: Proc. CIDR, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Ortiz, J., Lee, B., Balazinska, M., Hellerstein, J.L. PerfEnforce: a dynamic scaling engine for analytics with performance guarantees. CoRR abs/1605.09753, 2016.Google ScholarGoogle Scholar
  54. Pan, X., Venkataraman, S., Tai, Z., Gonzalez, J. Hemingway: modeling distributed optimization algorithms. In: Proc. NIPS Workshop on Machine Learning Systems, 2016.Google ScholarGoogle Scholar
  55. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 2011;12:2825--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Polo, J., Carrera, D., Becerra, Y., Steinder, M., Whalley, I. Performance-driven task co-scheduling for MapReduce environments. In: Proc. NOMS, 2010, pp. 373--80.Google ScholarGoogle ScholarCross RefCross Ref
  57. Popescu, A.D., Balmin, A., Ercegovac, V., Ailamaki, A. PREDIcT: towards predicting the runtime of large scale iterative analytics. PVLDB 2013;6(14):1678--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Practice Fusion diabetes classification homepage. https://www.kaggle.com/c/pf2012-diabetes, 2017.Google ScholarGoogle Scholar
  59. Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Predicting execution time of machine learning tasks for scheduling. Int J Hybrid Intell Syst 2013;10(1):23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Using genetic algorithms to improve prediction of execution times of ML tasks. In: Proc. HAIS (1), 2012, pp. 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Predicting execution time of machine learning tasks using metalearning. In: Proc. WICT, 2011, pp. 1193--8.Google ScholarGoogle ScholarCross RefCross Ref
  62. Progress bar in random forest model in R. https://stackoverflow.com/questions/32791701/progress-barin-random-forest-model-in-r.Google ScholarGoogle Scholar
  63. Reif, M., Shafait, F., Dengel, A. Prediction of classifier training time including parameter optimization. In: Proc. KI, 2011, pp. 260--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Reiner-Benaim, A., Grabarnick, A., Shmueli, E. Highly accurate prediction of jobs runtime classes. International Journal of Advanced Research in Artificial Intelligence 2016;5(6):28--34.Google ScholarGoogle Scholar
  65. Sarkar, V. Determining average program execution times and their variance. In: Proc. PLDI, 1989, pp. 298--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Senger, L.J., Santana, M.J., Santana, R.H.C. An instancebased learning approach for predicting execution times of parallel applications. In: Proc. I2T2S, 2004, pp. 9--15.Google ScholarGoogle Scholar
  67. Smith, W., Foster, I.T., Taylor, V.E. Predicting application run times with historical information. J Parallel Distrib Comput 2004;64(9):1007--16.Snoek, J., Larochelle, H., Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In: Proc. NIPS, 2012, pp. 2960--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Snoek, J., Larochelle, H., Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In: Proc. NIPS, 2012, pp. 2960--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Sparks, E.R., Venkataraman, S., Kaftan, T., Franklin, M.J., Recht, B. KeystoneML: optimizing pipelines for large-scale advanced analytics. In: Proc. ICDE, 2017, pp. 535--46.Google ScholarGoogle ScholarCross RefCross Ref
  70. Sra, S., Nowozin, S., Wright, S.J. Optimization for Machine Learning. Cambridge, MA: The MIT Press 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proc. KDD, 2013, pp. 847--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Venkataraman, S., Yang, Z., Franklin, M.J., Recht, B., Stoica, I. Ernest: efficient performance prediction for largescale advanced analytics. In: Proc. NSDI, 2016, pp. 363--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Verma, A., Cherkasova, L., Campbell, R.H. ARIA: automatic resource inference and allocation for MapReduce environments. In: Proc. ICAC, 2011, pp. 235--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Wegbreit, B. Mechanical program analysis. Communications of the ACM 1975;18(9):528--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed. Burlington, MA: Morgan Kaufmann 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Computation 1996;8(7):1341--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Xie, X., Fan, Z., Choi, B., Yi, P., Bhowmick, S.S., Zhou, S. PIGEON: Progress indicator for subgraph queries. In: Proc. ICDE, 2015, pp. 1492--5.Google ScholarGoogle Scholar
  78. Xiong, P., Chi, Y., Zhu, S., Tatemura, J., Pu, C., Hacigümüs, H. ActiveSLA: a profit-oriented admission control framework for database-as-a-service providers. In: Proc. SoCC, 2011, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Zaki, M.J., Ho, C. Large-Scale Parallel Data Mining. New York, NY: Springer 2000.Google ScholarGoogle Scholar
  80. Zeng, X., Luo, G. Progressive sampling-based Bayesian optimization for efficient and automatic machine learning model selection. Health Inf Sci Syst 2017;5(1):2.Google ScholarGoogle Scholar

Index Terms

  1. Toward a Progress Indicator for Machine Learning Model Building and Data Mining Algorithm Execution: A Position Paper
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader