skip to main content
10.1145/3447545.3451185acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article
Open Access

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

Authors Info & Claims
Published:19 April 2021Publication History

ABSTRACT

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.

References

  1. Agelastos et al. 2014. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. In SC.Google ScholarGoogle Scholar
  2. Amershi et al. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).Google ScholarGoogle Scholar
  3. Bal et al. 2016. A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term. IEEE Computer (2016).Google ScholarGoogle Scholar
  4. Banerjee et al. 2020. Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments. In 2020 USENIX Conference on Operational Machine Learning (OpML 20).Google ScholarGoogle Scholar
  5. Baylor et al. 2017. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Baylor et al. 2019. Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).Google ScholarGoogle Scholar
  7. Ben-Nun et al. 2019. A modular benchmarking infrastructure for high-performance and reproducible deep learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarGoogle ScholarCross RefCross Ref
  8. Chu et al. 2016. Data cleaning: Overview and emerging challenges. In SIGMOD.Google ScholarGoogle Scholar
  9. Coleman et al. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training, Vol. 100 (2017).Google ScholarGoogle Scholar
  10. Crankshaw et al. 2017. Clipper: A Low-Latency Online Prediction Serving System. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation.Google ScholarGoogle Scholar
  11. Dai et al. 2011. HiTune: Dataflow-Based Performance Analysis for Big Data Cloud. In ATC.Google ScholarGoogle Scholar
  12. Gadiraju et al. 2020. Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gardu n o et al. 2012. Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters. In Strategies, Tools, and Techniques: Proceedings of the 26th Large Installation System Administration Conference, LISA 2012, San Diego, CA, USA, December 9--14, 2012.Google ScholarGoogle Scholar
  14. Ghorbani and Zou. 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 (2019).Google ScholarGoogle Scholar
  15. Yolanda Gil and Bart Selman. 2019. A 20-Year Community Roadmap for Artificial Intelligence Research in the US. arxiv: 1908.02624 [cs.CY]Google ScholarGoogle Scholar
  16. Guo et al. 2011. G2: A Graph Processing System for Diagnosing Distributed Systems. In ATC.Google ScholarGoogle Scholar
  17. Harlap et al. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).Google ScholarGoogle Scholar
  18. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google ScholarGoogle ScholarCross RefCross Ref
  19. Hegeman et al. 2020. Grade10: A Framework for Performance Characterization of Distributed Graph Processing. In CLUSTER.Google ScholarGoogle Scholar
  20. Hestness et al. 2019. Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hopsworks. 2021. Hopsworks. https://www.hopsworks.ai/.Google ScholarGoogle Scholar
  22. Huang et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  23. Jansen et al. 2020. DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning. In ISC.Google ScholarGoogle Scholar
  24. Jayaram et al. 2019. FfDL: A Flexible Multi-Tenant Deep Learning Platform. In Proceedings of the 20th International Middleware Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jeon et al. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jiang et al. 2020. Hpc ai500: The methodology, tools, roofline performance models, and metrics for benchmarking hpc ai systems. arXiv preprint arXiv:2007.00279 (2020).Google ScholarGoogle Scholar
  27. Justus et al. 2018. Predicting the computational cost of deep learning models. In 2018 IEEE International Conference on Big Data (Big Data).Google ScholarGoogle ScholarCross RefCross Ref
  28. Karlavs et al. 2020. Building Continuous Integration Services for Machine Learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kim and Lee. 2020. Reducing Tail Latency of DNN-Based Recommender Systems Using in-Storage Processing. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems.Google ScholarGoogle Scholar
  30. Knü pfer et al. 2008. The Vampir Performance Analysis Tool-Set. In Tools for High Performance Computing - Proceedings of the 2nd International Workshop on Parallel Tools for High Performance Computing, July 2008, HLRS, Stuttgart.Google ScholarGoogle Scholar
  31. Kotthoff et al. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. The Journal of Machine Learning Research, Vol. 18 (2017).Google ScholarGoogle Scholar
  32. Kraska. 2018. Northstar: An interactive data science system. Proceedings of the VLDB Endowment, Vol. 11 (2018).Google ScholarGoogle Scholar
  33. Li et al. 2018. Ease.Ml: Towards Multi-Tenant Resource Sharing for Machine Learning Workloads. Proc. VLDB Endow., Vol. 11 (2018).Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Li et al. 2019. Across-Stack Profiling and Characterization of Machine Learning Models on GPUs. CoRR, Vol. abs/1908.06869 (2019).Google ScholarGoogle Scholar
  35. Lim et al. 2019. MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).Google ScholarGoogle Scholar
  36. Litjens et al. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis, Vol. 42 (2017).Google ScholarGoogle Scholar
  37. Mace et al. 2015a. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mace et al. 2015b. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation.Google ScholarGoogle Scholar
  39. Mai et al. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In OSDI.Google ScholarGoogle Scholar
  40. Mattson et al. 2019. Mlperf training benchmark. arXiv preprint arXiv:1910.01500 (2019).Google ScholarGoogle Scholar
  41. Mayer and Jacobsen. 2020. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools. ACM Comput. Surv., Vol. 53 (2020).Google ScholarGoogle Scholar
  42. Mirgorodskiy et al. 2008. Diagnosing distributed systems with self-propelled instrumentation. In Middleware.Google ScholarGoogle Scholar
  43. NVIDIA. 2021 a. NVIDIA Nsight. https://developer.nvidia.com/tools-overview.Google ScholarGoogle Scholar
  44. NVIDIA. 2021 b. NVIDIA Profiler. https://docs.nvidia.com/cuda/profiler-users-guide/index.html.Google ScholarGoogle Scholar
  45. Ousterhout et al. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI.Google ScholarGoogle Scholar
  46. Pi et al. 2018. Profiling distributed systems in lightweight virtualized environments with logs and resource metrics. In HPDC.Google ScholarGoogle Scholar
  47. Polyzotis et al. 2019. Data validation for machine learning. Proceedings of Machine Learning and Systems, Vol. 1 (2019).Google ScholarGoogle Scholar
  48. Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).Google ScholarGoogle Scholar
  49. Salloum et al. 2017. A Survey of Text Mining in Social Media: Facebook and Twitter Perspectives. Advances in Science, Technology and Engineering Systems Journal, Vol. 2 (2017).Google ScholarGoogle Scholar
  50. Sergeev and Del. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).Google ScholarGoogle Scholar
  51. Shende and Malony. 2006. The TAU Parallel Performance System. IJHPCA, Vol. 20 (2006).Google ScholarGoogle Scholar
  52. Sigelman et al. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).Google ScholarGoogle Scholar
  53. Tian et al. 2018. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Tian et al. 2019. Towards Framework-Independent, Non-Intrusive Performance Characterization for Dataflow Computation. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Wang et al. 2012. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada, December 3--7, 2012. Proceedings, Vol. 7662.Google ScholarGoogle Scholar
  56. Wang et al. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).Google ScholarGoogle Scholar
  57. Wang et al. 2020. Metis: learning to schedule long-running applications in shared container clusters at scale. In SC.Google ScholarGoogle Scholar
  58. Yang et al. [n.d.]. End-to-end I/O Monitoring on a Leading Supercomputer. In NSDI.Google ScholarGoogle Scholar
  59. Yang et al. 2018. Nanolog: A nanosecond scale logging system. In ATC.Google ScholarGoogle Scholar
  60. You et al. 2018. Imagenet training in minutes. In ICPP.Google ScholarGoogle Scholar
  61. Zaharia et al. 2018. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull., Vol. 41 (2018).Google ScholarGoogle Scholar
  62. Zhang et al. 2014. MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines. In CCGrid.Google ScholarGoogle Scholar
  63. Zhang et al. 2020. Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20).Google ScholarGoogle Scholar
  64. Zhao et al. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In OSDI.Google ScholarGoogle Scholar
  65. Zhao et al. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP.Google ScholarGoogle Scholar
  66. Zhu et al. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. arXiv preprint arXiv:2006.03318 (2020).Google ScholarGoogle Scholar

Index Terms

  1. GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering
      April 2021
      198 pages
      ISBN:9781450383318
      DOI:10.1145/3447545

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 April 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate252of851submissions,30%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader