Abstract
For user-friendliness, many software systems offer progress indicators for long-duration tasks. A typical progress indicator continuously estimates the remaining task execution time as well as the portion of the task that has been finished. Building a machine learning model often takes a long time, but no existing machine learning software supplies a non-trivial progress indicator. Similarly, running a data mining algorithm often takes a long time, but no existing data mining software provides a nontrivial progress indicator. In this article, we consider the problem of offering progress indicators for machine learning model building and data mining algorithm execution. We discuss the goals and challenges intrinsic to this problem. Then we describe an initial framework for implementing such progress indicators and two advanced, potential uses of them, with the goal of inspiring future research on this topic
- A progress bar for scikit-learn? https://stackoverflow.com/questions/34251980/a-progressbar-for-scikit-learn.Google Scholar
- Aggarwal, C.C. Data Mining: The Textbook. New York, NY: Springer 2015. Google ScholarDigital Library
- Agrawal, R., Srikant, R. Fast algorithms for mining association rules in large databases. In: Proc. VLDB, 1994, pp. 487--99. Google ScholarDigital Library
- Alpaydin, E. Introduction to Machine Learning. Cambridge, MA: The MIT Press 2014. Google ScholarDigital Library
- Babich, N. Best practices for animated progress indicators. https://www.smashingmagazine.com/2016/12/best-practicesfor-animated-progress-indicators/.Google Scholar
- Bekkerman, R., Bilenko, M., Langford, J. Scaling up Machine Learning: Parallel and Distributed Approaches. New York, NY: Cambridge University Press 2011. Google ScholarDigital Library
- Bergstra, J., Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research 2012;13:281--305. Google ScholarDigital Library
- Berque, D.A., Goldberg, M.K. Monitoring an algorithm's execution. Computational Support for Discrete Mathematics, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 15, 1992:153--63.Google ScholarCross Ref
- Bottou, L., Chapelle, O., DeCoste, D., Weston, J. Large Scale Kernel Machines. Cambridge, MA: MIT Press 2007. Google ScholarDigital Library
- Brazdil, P., Soares, C., da Costa, J.P. Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Machine Learning 2003;50(3):251--77. Google ScholarDigital Library
- Chaudhuri, S., Narasayya, V.R., Ramamurthy, R. Estimating progress of long running SQL queries. In: Proc. SIGMOD, 2004, pp. 803--14. Google ScholarDigital Library
- Chi, Y., Moon, H.J., Hacigümüs, H., Tatemura, J. SLA-tree: a framework for efficiently supporting SLA-based decisions in cloud computing. In: Proc. EDBT, 2011, pp. 129--40. Google ScholarDigital Library
- Delimitrou, C., Kozyrakis, C. QoS-aware scheduling in heterogeneous datacenters with Paragon. ACM Trans Comput Syst 2013;31(4):12. Google ScholarDigital Library
- Delimitrou, C., Kozyrakis, C. Quasar: resource-efficient and QoS-aware cluster management. In: Proc. ASPLOS, 2014, pp. 127--44. Google ScholarDigital Library
- Doan, T., Kalita, J. Predicting run time of classification algorithms using meta-learning approach. International Journal of Machine Learning and Cybernetics, 2016.Google Scholar
- Ferguson, A.D., Bodík, P., Kandula, S., Boutin, E., Fonseca, R. Jockey: guaranteed job latency in data parallel clusters. In: Proc. EuroSys, 2012, pp. 99--112. Google ScholarDigital Library
- Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F. Efficient and robust automated machine learning. In: Proc. NIPS, 2015, pp. 2944--52. Google ScholarDigital Library
- Feurer, M., Springenberg, T., Hutter, F. Initializing Bayesian hyperparameter optimization via meta-learning. In: Proc. AAAI, 2015, pp. 1128--35. Google ScholarDigital Library
- Figueroa, R.L., Zeng-Treitler, Q., Kandula, S., Ngo, L.H. Predicting sample size required for classification performance. BMC Med Inform Decis Mak 2012;12:8.Google Scholar
- Flajolet, P., Steyaert, J. A complexity calculus for recursive tree algorithms. Mathematical Systems Theory 1987;19(4):301--31.Google Scholar
- Gandhi, A., Thota, S., Dube, P., Kochut, A., Zhang, L. Autoscaling for Hadoop clusters. In: Proc. IC2E, 2016, pp. 109--18. Google ScholarDigital Library
- Gupta, C., Mehta, A., Dayal, U. PQR: predicting query execution times for autonomous workload management. In: Proc. ICAC, 2008, pp. 13--22. Google ScholarDigital Library
- Han, J., Pei, J., Yin, Y. Mining frequent patterns without candidate generation. In: Proc. SIGMOD, 2000, pp. 1--12. Google ScholarDigital Library
- Herodotou, H., Dong, F., Babu, S. No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: Proc. SoCC, 2011, 18. Google ScholarDigital Library
- Hickey, T.J., Cohen, J. Automating program analysis. Journal of the ACM 1988;35(1):185--220. Google ScholarDigital Library
- Hickins, M. Citizen data scientists unite! http://www.forbes.com/sites/oracle/2016/10/03/citizen-datascientists-unite.Google Scholar
- Hu, Y., Sundara, S., Srinivasan, J. Supporting timeconstrained SQL queries in Oracle. In: Proc. VLDB, 2007, pp. 1207--18. Google ScholarDigital Library
- Huang, B., Boehm, M., Tian, Y., Reinwald, B., Tatikonda, S., Reiss, F.R. Resource elasticity for large-scale machine learning. In: Proc. SIGMOD, 2015, pp. 137--52. Google ScholarDigital Library
- Huang, L., Jia, J., Yu, B., Chun, B., Maniatis, P., Naik, M. Predicting execution time of computer programs using sparse polynomial regression. In: Proc. NIPS, 2010, pp. 883--91. Google ScholarDigital Library
- Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K. Algorithm runtime prediction: methods & evaluation. Artif Intell 2014;206:79--111. Google ScholarDigital Library
- Jalaparti, V., Ballani, H., Costa, P., Karagiannis, T., Rowstron, A. Bridging the tenant-provider gap in cloud services. In: Proc. SoCC, 2012, 10. Google ScholarDigital Library
- Jovic, A., Brkic, K., Bogunovic, N. An overview of free software tools for general data mining. In: Proc. MIPRO, 2014, pp. 1112--7.Google ScholarCross Ref
- Kanter, J.M., Gillespie, O., Veeramachaneni, K. Label, segment, featurize: a cross domain framework for prediction engineering. In: Proc. DSAA, 2016, pp. 430--9.Google Scholar
- Kao, B., García-Molina, H. An overview of real-time database systems. In: Proc. NATO ASI RTC, 1992, pp. 261--82.Google Scholar
- Keras integration with TQDM progress bars. https://github.com/bstriner/keras-tqdm.Google Scholar
- Lam, H.T., Thiebaut, J., Sinn, M., Chen, B., Mai, T., Alkan, O. One button machine for automating feature engineering in relational databases. CoRR abs/1706.00327, 2017.Google Scholar
- Lee, K., König, A.C., Narasayya, V.R., Ding, B., Chaudhuri, S., Ellwein, B., Eksarevskiy, A., Kohli, M., Wyant, J., Prakash, P., Nehme, R.V., Li, J., Naughton, J.F. Operator and query progress estimation in Microsoft SQL Server Live Query Statistics. In: Proc. SIGMOD, 2016, pp. 1753--64. Google ScholarDigital Library
- Lee, W., Oh, H., Yi, K. A progress bar for static analyzers. In: Proc. SAS, 2014, pp. 184--200.Google ScholarCross Ref
- Lee, B., Schopf, J.M. Run-time prediction of parallel applications on shared environments. In: Proc. CLUSTER, 2003, pp. 487--91.Google Scholar
- Luo, G. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform 2016;5:18.Google Scholar
- Luo, G. PredicT-ML: a tool for automating machine learning model building with big clinical data. Health Inf Sci Syst 2016;4:5.Google Scholar
- Luo, G., Chen, T., Yu, H. Toward a progress indicator for program compilation. Software: Practice and Experience 2007;37(9):909--33. Google ScholarDigital Library
- Luo, G., Naughton, J.F., Ellmann, C.J., Watzke, M. Toward a progress indicator for database queries. In: Proc. SIGMOD, 2004, pp. 791--802. Google ScholarDigital Library
- Luo, G., Naughton, J.F., Ellmann, C.J., Watzke, M. Increasing the accuracy and coverage of SQL progress indicators. In: Proc. ICDE, 2005, pp. 853--64. Google ScholarDigital Library
- Luo, G., Naughton, J.F., Yu, P.S. Multi-query SQL progress indicators. In: Proc. EDBT, 2006, pp. 921--41. Google ScholarDigital Library
- Luo, G., Stone, B.L., Johnson, M.D., Tarczy-Hornoch, P., Wilcox, A.B., Mooney, S.D., Sheng, X., Haug, P.J., Nkoy, F.L. Automating construction of machine learning models with clinical big data: proposal rationale and methods. JMIR Res Protoc 2017;6(8):e175.Google Scholar
- Morton, K., Balazinska, M., Grossman, D. ParaTimer: a progress indicator for MapReduce DAGs. In: Proc. SIGMOD, 2010, pp. 507--18. Google ScholarDigital Library
- Morton, K., Friesen, A.L., Balazinska, M., Grossman, D. Estimating the progress of MapReduce pipelines. In: Proc. ICDE, 2010, pp. 681--4.Google ScholarCross Ref
- Myers, B.A. The importance of percent-done progress indicators for computer-human interfaces. In: Proc. SIGCHI, 1985, pp. 11--7. Google ScholarDigital Library
- Nielsen, J. Usability Engineering. San Francisco, CA: Morgan Kaufmann 1993. Google ScholarDigital Library
- Nocedal, J., Wright, S. Numerical Optimization, 2nd ed. New York, NY: Springer 2006.Google Scholar
- Ortiz, J., de Almeida, V.T., Balazinska, M. Changing the face of database cloud services with personalized service level agreements. In: Proc. CIDR, 2015. Google ScholarDigital Library
- Ortiz, J., Lee, B., Balazinska, M., Hellerstein, J.L. PerfEnforce: a dynamic scaling engine for analytics with performance guarantees. CoRR abs/1605.09753, 2016.Google Scholar
- Pan, X., Venkataraman, S., Tai, Z., Gonzalez, J. Hemingway: modeling distributed optimization algorithms. In: Proc. NIPS Workshop on Machine Learning Systems, 2016.Google Scholar
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 2011;12:2825--30. Google ScholarDigital Library
- Polo, J., Carrera, D., Becerra, Y., Steinder, M., Whalley, I. Performance-driven task co-scheduling for MapReduce environments. In: Proc. NOMS, 2010, pp. 373--80.Google ScholarCross Ref
- Popescu, A.D., Balmin, A., Ercegovac, V., Ailamaki, A. PREDIcT: towards predicting the runtime of large scale iterative analytics. PVLDB 2013;6(14):1678--89. Google ScholarDigital Library
- Practice Fusion diabetes classification homepage. https://www.kaggle.com/c/pf2012-diabetes, 2017.Google Scholar
- Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Predicting execution time of machine learning tasks for scheduling. Int J Hybrid Intell Syst 2013;10(1):23--32. Google ScholarDigital Library
- Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Using genetic algorithms to improve prediction of execution times of ML tasks. In: Proc. HAIS (1), 2012, pp. 196--207. Google ScholarDigital Library
- Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Predicting execution time of machine learning tasks using metalearning. In: Proc. WICT, 2011, pp. 1193--8.Google ScholarCross Ref
- Progress bar in random forest model in R. https://stackoverflow.com/questions/32791701/progress-barin-random-forest-model-in-r.Google Scholar
- Reif, M., Shafait, F., Dengel, A. Prediction of classifier training time including parameter optimization. In: Proc. KI, 2011, pp. 260--71. Google ScholarDigital Library
- Reiner-Benaim, A., Grabarnick, A., Shmueli, E. Highly accurate prediction of jobs runtime classes. International Journal of Advanced Research in Artificial Intelligence 2016;5(6):28--34.Google Scholar
- Sarkar, V. Determining average program execution times and their variance. In: Proc. PLDI, 1989, pp. 298--312. Google ScholarDigital Library
- Senger, L.J., Santana, M.J., Santana, R.H.C. An instancebased learning approach for predicting execution times of parallel applications. In: Proc. I2T2S, 2004, pp. 9--15.Google Scholar
- Smith, W., Foster, I.T., Taylor, V.E. Predicting application run times with historical information. J Parallel Distrib Comput 2004;64(9):1007--16.Snoek, J., Larochelle, H., Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In: Proc. NIPS, 2012, pp. 2960--8. Google ScholarDigital Library
- Snoek, J., Larochelle, H., Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In: Proc. NIPS, 2012, pp. 2960--8. Google ScholarDigital Library
- Sparks, E.R., Venkataraman, S., Kaftan, T., Franklin, M.J., Recht, B. KeystoneML: optimizing pipelines for large-scale advanced analytics. In: Proc. ICDE, 2017, pp. 535--46.Google ScholarCross Ref
- Sra, S., Nowozin, S., Wright, S.J. Optimization for Machine Learning. Cambridge, MA: The MIT Press 2011. Google ScholarDigital Library
- Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proc. KDD, 2013, pp. 847--55. Google ScholarDigital Library
- Venkataraman, S., Yang, Z., Franklin, M.J., Recht, B., Stoica, I. Ernest: efficient performance prediction for largescale advanced analytics. In: Proc. NSDI, 2016, pp. 363--78. Google ScholarDigital Library
- Verma, A., Cherkasova, L., Campbell, R.H. ARIA: automatic resource inference and allocation for MapReduce environments. In: Proc. ICAC, 2011, pp. 235--44. Google ScholarDigital Library
- Wegbreit, B. Mechanical program analysis. Communications of the ACM 1975;18(9):528--39. Google ScholarDigital Library
- Witten, I.H., Frank, E., Hall, M.A., Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed. Burlington, MA: Morgan Kaufmann 2016. Google ScholarDigital Library
- Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Computation 1996;8(7):1341--90. Google ScholarDigital Library
- Xie, X., Fan, Z., Choi, B., Yi, P., Bhowmick, S.S., Zhou, S. PIGEON: Progress indicator for subgraph queries. In: Proc. ICDE, 2015, pp. 1492--5.Google Scholar
- Xiong, P., Chi, Y., Zhu, S., Tatemura, J., Pu, C., Hacigümüs, H. ActiveSLA: a profit-oriented admission control framework for database-as-a-service providers. In: Proc. SoCC, 2011, 15. Google ScholarDigital Library
- Zaki, M.J., Ho, C. Large-Scale Parallel Data Mining. New York, NY: Springer 2000.Google Scholar
- Zeng, X., Luo, G. Progressive sampling-based Bayesian optimization for efficient and automatic machine learning model selection. Health Inf Sci Syst 2017;5(1):2.Google Scholar
Index Terms
- Toward a Progress Indicator for Machine Learning Model Building and Data Mining Algorithm Execution: A Position Paper
Recommendations
Progress Indication for Machine Learning Model Building: A Feasibility Demonstration
Progress indicators are desirable for machine learning model building that often takes a long time, by continuously estimating the remaining model building time and the portion of model building work that has been finished. Recently, we proposed a high-...
Where Am I? A Meta-Analysis of Experiments on the Effects of Progress Indicators for Web Surveys
The use of progress indicators seems to be standard in many online surveys. Researchers include them in surveys in the hope they will help reduce drop-off rates. However, there is no consensus in the literature regarding their effects. In this meta-...
Toward a progress indicator for program compilation
For user-friendliness purposes, many modern software systems provide progress indicators for long-running tasks. These progress indicators continuously estimate the percentage of the task that has been completed and when the task will finish. However, ...
Comments