research-article

Toward a Progress Indicator for Machine Learning Model Building and Data Mining Algorithm Execution: A Position Paper

Author:
Gang Luo

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 19 Issue 2December 2017pp 13–24https://doi.org/10.1145/3166054.3166057

Published:21 November 2017Publication History

ACM SIGKDD Explorations Newsletter

Abstract

For user-friendliness, many software systems offer progress indicators for long-duration tasks. A typical progress indicator continuously estimates the remaining task execution time as well as the portion of the task that has been finished. Building a machine learning model often takes a long time, but no existing machine learning software supplies a non-trivial progress indicator. Similarly, running a data mining algorithm often takes a long time, but no existing data mining software provides a nontrivial progress indicator. In this article, we consider the problem of offering progress indicators for machine learning model building and data mining algorithm execution. We discuss the goals and challenges intrinsic to this problem. Then we describe an initial framework for implementing such progress indicators and two advanced, potential uses of them, with the goal of inspiring future research on this topic

References

A progress bar for scikit-learn? https://stackoverflow.com/questions/34251980/a-progressbar-for-scikit-learn.Google Scholar
Aggarwal, C.C. Data Mining: The Textbook. New York, NY: Springer 2015. Google ScholarDigital Library
Agrawal, R., Srikant, R. Fast algorithms for mining association rules in large databases. In: Proc. VLDB, 1994, pp. 487--99. Google ScholarDigital Library
Alpaydin, E. Introduction to Machine Learning. Cambridge, MA: The MIT Press 2014. Google ScholarDigital Library
Babich, N. Best practices for animated progress indicators. https://www.smashingmagazine.com/2016/12/best-practicesfor-animated-progress-indicators/.Google Scholar
Bekkerman, R., Bilenko, M., Langford, J. Scaling up Machine Learning: Parallel and Distributed Approaches. New York, NY: Cambridge University Press 2011. Google ScholarDigital Library
Bergstra, J., Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research 2012;13:281--305. Google ScholarDigital Library
Berque, D.A., Goldberg, M.K. Monitoring an algorithm's execution. Computational Support for Discrete Mathematics, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 15, 1992:153--63.Google ScholarCross Ref
Bottou, L., Chapelle, O., DeCoste, D., Weston, J. Large Scale Kernel Machines. Cambridge, MA: MIT Press 2007. Google ScholarDigital Library
Brazdil, P., Soares, C., da Costa, J.P. Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Machine Learning 2003;50(3):251--77. Google ScholarDigital Library
Chaudhuri, S., Narasayya, V.R., Ramamurthy, R. Estimating progress of long running SQL queries. In: Proc. SIGMOD, 2004, pp. 803--14. Google ScholarDigital Library
Chi, Y., Moon, H.J., Hacigümüs, H., Tatemura, J. SLA-tree: a framework for efficiently supporting SLA-based decisions in cloud computing. In: Proc. EDBT, 2011, pp. 129--40. Google ScholarDigital Library
Delimitrou, C., Kozyrakis, C. QoS-aware scheduling in heterogeneous datacenters with Paragon. ACM Trans Comput Syst 2013;31(4):12. Google ScholarDigital Library
Delimitrou, C., Kozyrakis, C. Quasar: resource-efficient and QoS-aware cluster management. In: Proc. ASPLOS, 2014, pp. 127--44. Google ScholarDigital Library
Doan, T., Kalita, J. Predicting run time of classification algorithms using meta-learning approach. International Journal of Machine Learning and Cybernetics, 2016.Google Scholar
Ferguson, A.D., Bodík, P., Kandula, S., Boutin, E., Fonseca, R. Jockey: guaranteed job latency in data parallel clusters. In: Proc. EuroSys, 2012, pp. 99--112. Google ScholarDigital Library
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F. Efficient and robust automated machine learning. In: Proc. NIPS, 2015, pp. 2944--52. Google ScholarDigital Library
Feurer, M., Springenberg, T., Hutter, F. Initializing Bayesian hyperparameter optimization via meta-learning. In: Proc. AAAI, 2015, pp. 1128--35. Google ScholarDigital Library
Figueroa, R.L., Zeng-Treitler, Q., Kandula, S., Ngo, L.H. Predicting sample size required for classification performance. BMC Med Inform Decis Mak 2012;12:8.Google Scholar
Flajolet, P., Steyaert, J. A complexity calculus for recursive tree algorithms. Mathematical Systems Theory 1987;19(4):301--31.Google Scholar
Gandhi, A., Thota, S., Dube, P., Kochut, A., Zhang, L. Autoscaling for Hadoop clusters. In: Proc. IC2E, 2016, pp. 109--18. Google ScholarDigital Library
Gupta, C., Mehta, A., Dayal, U. PQR: predicting query execution times for autonomous workload management. In: Proc. ICAC, 2008, pp. 13--22. Google ScholarDigital Library
Han, J., Pei, J., Yin, Y. Mining frequent patterns without candidate generation. In: Proc. SIGMOD, 2000, pp. 1--12. Google ScholarDigital Library
Herodotou, H., Dong, F., Babu, S. No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: Proc. SoCC, 2011, 18. Google ScholarDigital Library
Hickey, T.J., Cohen, J. Automating program analysis. Journal of the ACM 1988;35(1):185--220. Google ScholarDigital Library
Hickins, M. Citizen data scientists unite! http://www.forbes.com/sites/oracle/2016/10/03/citizen-datascientists-unite.Google Scholar
Hu, Y., Sundara, S., Srinivasan, J. Supporting timeconstrained SQL queries in Oracle. In: Proc. VLDB, 2007, pp. 1207--18. Google ScholarDigital Library
Huang, B., Boehm, M., Tian, Y., Reinwald, B., Tatikonda, S., Reiss, F.R. Resource elasticity for large-scale machine learning. In: Proc. SIGMOD, 2015, pp. 137--52. Google ScholarDigital Library
Huang, L., Jia, J., Yu, B., Chun, B., Maniatis, P., Naik, M. Predicting execution time of computer programs using sparse polynomial regression. In: Proc. NIPS, 2010, pp. 883--91. Google ScholarDigital Library
Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K. Algorithm runtime prediction: methods & evaluation. Artif Intell 2014;206:79--111. Google ScholarDigital Library
Jalaparti, V., Ballani, H., Costa, P., Karagiannis, T., Rowstron, A. Bridging the tenant-provider gap in cloud services. In: Proc. SoCC, 2012, 10. Google ScholarDigital Library
Jovic, A., Brkic, K., Bogunovic, N. An overview of free software tools for general data mining. In: Proc. MIPRO, 2014, pp. 1112--7.Google ScholarCross Ref
Kanter, J.M., Gillespie, O., Veeramachaneni, K. Label, segment, featurize: a cross domain framework for prediction engineering. In: Proc. DSAA, 2016, pp. 430--9.Google Scholar
Kao, B., García-Molina, H. An overview of real-time database systems. In: Proc. NATO ASI RTC, 1992, pp. 261--82.Google Scholar
Keras integration with TQDM progress bars. https://github.com/bstriner/keras-tqdm.Google Scholar
Lam, H.T., Thiebaut, J., Sinn, M., Chen, B., Mai, T., Alkan, O. One button machine for automating feature engineering in relational databases. CoRR abs/1706.00327, 2017.Google Scholar
Lee, K., König, A.C., Narasayya, V.R., Ding, B., Chaudhuri, S., Ellwein, B., Eksarevskiy, A., Kohli, M., Wyant, J., Prakash, P., Nehme, R.V., Li, J., Naughton, J.F. Operator and query progress estimation in Microsoft SQL Server Live Query Statistics. In: Proc. SIGMOD, 2016, pp. 1753--64. Google ScholarDigital Library
Lee, W., Oh, H., Yi, K. A progress bar for static analyzers. In: Proc. SAS, 2014, pp. 184--200.Google ScholarCross Ref
Lee, B., Schopf, J.M. Run-time prediction of parallel applications on shared environments. In: Proc. CLUSTER, 2003, pp. 487--91.Google Scholar
Luo, G. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform 2016;5:18.Google Scholar
Luo, G. PredicT-ML: a tool for automating machine learning model building with big clinical data. Health Inf Sci Syst 2016;4:5.Google Scholar
Luo, G., Chen, T., Yu, H. Toward a progress indicator for program compilation. Software: Practice and Experience 2007;37(9):909--33. Google ScholarDigital Library
Luo, G., Naughton, J.F., Ellmann, C.J., Watzke, M. Toward a progress indicator for database queries. In: Proc. SIGMOD, 2004, pp. 791--802. Google ScholarDigital Library
Luo, G., Naughton, J.F., Ellmann, C.J., Watzke, M. Increasing the accuracy and coverage of SQL progress indicators. In: Proc. ICDE, 2005, pp. 853--64. Google ScholarDigital Library
Luo, G., Naughton, J.F., Yu, P.S. Multi-query SQL progress indicators. In: Proc. EDBT, 2006, pp. 921--41. Google ScholarDigital Library
Luo, G., Stone, B.L., Johnson, M.D., Tarczy-Hornoch, P., Wilcox, A.B., Mooney, S.D., Sheng, X., Haug, P.J., Nkoy, F.L. Automating construction of machine learning models with clinical big data: proposal rationale and methods. JMIR Res Protoc 2017;6(8):e175.Google Scholar
Morton, K., Balazinska, M., Grossman, D. ParaTimer: a progress indicator for MapReduce DAGs. In: Proc. SIGMOD, 2010, pp. 507--18. Google ScholarDigital Library
Morton, K., Friesen, A.L., Balazinska, M., Grossman, D. Estimating the progress of MapReduce pipelines. In: Proc. ICDE, 2010, pp. 681--4.Google ScholarCross Ref
Myers, B.A. The importance of percent-done progress indicators for computer-human interfaces. In: Proc. SIGCHI, 1985, pp. 11--7. Google ScholarDigital Library
Nielsen, J. Usability Engineering. San Francisco, CA: Morgan Kaufmann 1993. Google ScholarDigital Library
Nocedal, J., Wright, S. Numerical Optimization, 2nd ed. New York, NY: Springer 2006.Google Scholar
Ortiz, J., de Almeida, V.T., Balazinska, M. Changing the face of database cloud services with personalized service level agreements. In: Proc. CIDR, 2015. Google ScholarDigital Library
Ortiz, J., Lee, B., Balazinska, M., Hellerstein, J.L. PerfEnforce: a dynamic scaling engine for analytics with performance guarantees. CoRR abs/1605.09753, 2016.Google Scholar
Pan, X., Venkataraman, S., Tai, Z., Gonzalez, J. Hemingway: modeling distributed optimization algorithms. In: Proc. NIPS Workshop on Machine Learning Systems, 2016.Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 2011;12:2825--30. Google ScholarDigital Library
Polo, J., Carrera, D., Becerra, Y., Steinder, M., Whalley, I. Performance-driven task co-scheduling for MapReduce environments. In: Proc. NOMS, 2010, pp. 373--80.Google ScholarCross Ref
Popescu, A.D., Balmin, A., Ercegovac, V., Ailamaki, A. PREDIcT: towards predicting the runtime of large scale iterative analytics. PVLDB 2013;6(14):1678--89. Google ScholarDigital Library
Practice Fusion diabetes classification homepage. https://www.kaggle.com/c/pf2012-diabetes, 2017.Google Scholar
Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Predicting execution time of machine learning tasks for scheduling. Int J Hybrid Intell Syst 2013;10(1):23--32. Google ScholarDigital Library
Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Using genetic algorithms to improve prediction of execution times of ML tasks. In: Proc. HAIS (1), 2012, pp. 196--207. Google ScholarDigital Library
Priya, R., de Souza, B.F., Rossi, A.L.D., de Carvalho André, C.P.L.F. Predicting execution time of machine learning tasks using metalearning. In: Proc. WICT, 2011, pp. 1193--8.Google ScholarCross Ref
Progress bar in random forest model in R. https://stackoverflow.com/questions/32791701/progress-barin-random-forest-model-in-r.Google Scholar
Reif, M., Shafait, F., Dengel, A. Prediction of classifier training time including parameter optimization. In: Proc. KI, 2011, pp. 260--71. Google ScholarDigital Library
Reiner-Benaim, A., Grabarnick, A., Shmueli, E. Highly accurate prediction of jobs runtime classes. International Journal of Advanced Research in Artificial Intelligence 2016;5(6):28--34.Google Scholar
Sarkar, V. Determining average program execution times and their variance. In: Proc. PLDI, 1989, pp. 298--312. Google ScholarDigital Library
Senger, L.J., Santana, M.J., Santana, R.H.C. An instancebased learning approach for predicting execution times of parallel applications. In: Proc. I2T2S, 2004, pp. 9--15.Google Scholar
Smith, W., Foster, I.T., Taylor, V.E. Predicting application run times with historical information. J Parallel Distrib Comput 2004;64(9):1007--16.Snoek, J., Larochelle, H., Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In: Proc. NIPS, 2012, pp. 2960--8. Google ScholarDigital Library
Snoek, J., Larochelle, H., Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In: Proc. NIPS, 2012, pp. 2960--8. Google ScholarDigital Library
Sparks, E.R., Venkataraman, S., Kaftan, T., Franklin, M.J., Recht, B. KeystoneML: optimizing pipelines for large-scale advanced analytics. In: Proc. ICDE, 2017, pp. 535--46.Google ScholarCross Ref
Sra, S., Nowozin, S., Wright, S.J. Optimization for Machine Learning. Cambridge, MA: The MIT Press 2011. Google ScholarDigital Library
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proc. KDD, 2013, pp. 847--55. Google ScholarDigital Library
Venkataraman, S., Yang, Z., Franklin, M.J., Recht, B., Stoica, I. Ernest: efficient performance prediction for largescale advanced analytics. In: Proc. NSDI, 2016, pp. 363--78. Google ScholarDigital Library
Verma, A., Cherkasova, L., Campbell, R.H. ARIA: automatic resource inference and allocation for MapReduce environments. In: Proc. ICAC, 2011, pp. 235--44. Google ScholarDigital Library
Wegbreit, B. Mechanical program analysis. Communications of the ACM 1975;18(9):528--39. Google ScholarDigital Library
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed. Burlington, MA: Morgan Kaufmann 2016. Google ScholarDigital Library
Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Computation 1996;8(7):1341--90. Google ScholarDigital Library
Xie, X., Fan, Z., Choi, B., Yi, P., Bhowmick, S.S., Zhou, S. PIGEON: Progress indicator for subgraph queries. In: Proc. ICDE, 2015, pp. 1492--5.Google Scholar
Xiong, P., Chi, Y., Zhu, S., Tatemura, J., Pu, C., Hacigümüs, H. ActiveSLA: a profit-oriented admission control framework for database-as-a-service providers. In: Proc. SoCC, 2011, 15. Google ScholarDigital Library
Zaki, M.J., Ho, C. Large-Scale Parallel Data Mining. New York, NY: Springer 2000.Google Scholar
Zeng, X., Luo, G. Progressive sampling-based Bayesian optimization for efficient and automatic machine learning model selection. Health Inf Sci Syst 2017;5(1):2.Google Scholar

Index Terms

Toward a Progress Indicator for Machine Learning Model Building and Data Mining Algorithm Execution: A Position Paper
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Progress Indication for Machine Learning Model Building: A Feasibility Demonstration

Progress indicators are desirable for machine learning model building that often takes a long time, by continuously estimating the remaining model building time and the portion of model building work that has been finished. Recently, we proposed a high-...
Read More
Where Am I? A Meta-Analysis of Experiments on the Effects of Progress Indicators for Web Surveys

The use of progress indicators seems to be standard in many online surveys. Researchers include them in surveys in the hope they will help reduce drop-off rates. However, there is no consensus in the literature regarding their effects. In this meta-...
Read More
Toward a progress indicator for program compilation

For user-friendliness purposes, many modern software systems provide progress indicators for long-running tasks. These progress indicators continuously estimate the percentage of the task that has been completed and when the task will finish. However, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGKDD Explorations Newsletter Volume 19, Issue 2
December 2017
46 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/3166054
Editors:
Charu Aggarwal
IBM T.J. Watson
,
Haixun Wang
Google
,
Ankur Teredesai
University of Washington Tacoma
,
Hanghang Tong
Arizona State University
Issue’s Table of Contents
Copyright © 2017 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 November 2017
Check for updates
Author Tags
automatic administration
data mining
load management
machine learning
progress indicator
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 124
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toward a Progress Indicator for Machine Learning Model Building and Data Mining Algorithm Execution: A Position Paper

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Progress Indication for Machine Learning Model Building: A Feasibility Demonstration

Where Am I? A Meta-Analysis of Experiments on the Effects of Progress Indicators for Web Surveys

Toward a progress indicator for program compilation