ABSTRACT
Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.
- Agelastos et al. 2014. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. In SC.Google Scholar
- Amershi et al. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).Google Scholar
- Bal et al. 2016. A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term. IEEE Computer (2016).Google Scholar
- Banerjee et al. 2020. Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments. In 2020 USENIX Conference on Operational Machine Learning (OpML 20).Google Scholar
- Baylor et al. 2017. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google ScholarDigital Library
- Baylor et al. 2019. Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).Google Scholar
- Ben-Nun et al. 2019. A modular benchmarking infrastructure for high-performance and reproducible deep learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarCross Ref
- Chu et al. 2016. Data cleaning: Overview and emerging challenges. In SIGMOD.Google Scholar
- Coleman et al. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training, Vol. 100 (2017).Google Scholar
- Crankshaw et al. 2017. Clipper: A Low-Latency Online Prediction Serving System. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation.Google Scholar
- Dai et al. 2011. HiTune: Dataflow-Based Performance Analysis for Big Data Cloud. In ATC.Google Scholar
- Gadiraju et al. 2020. Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.Google ScholarDigital Library
- Gardu n o et al. 2012. Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters. In Strategies, Tools, and Techniques: Proceedings of the 26th Large Installation System Administration Conference, LISA 2012, San Diego, CA, USA, December 9--14, 2012.Google Scholar
- Ghorbani and Zou. 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 (2019).Google Scholar
- Yolanda Gil and Bart Selman. 2019. A 20-Year Community Roadmap for Artificial Intelligence Research in the US. arxiv: 1908.02624 [cs.CY]Google Scholar
- Guo et al. 2011. G2: A Graph Processing System for Diagnosing Distributed Systems. In ATC.Google Scholar
- Harlap et al. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).Google Scholar
- He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google ScholarCross Ref
- Hegeman et al. 2020. Grade10: A Framework for Performance Characterization of Distributed Graph Processing. In CLUSTER.Google Scholar
- Hestness et al. 2019. Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.Google ScholarDigital Library
- Hopsworks. 2021. Hopsworks. https://www.hopsworks.ai/.Google Scholar
- Huang et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems.Google Scholar
- Jansen et al. 2020. DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning. In ISC.Google Scholar
- Jayaram et al. 2019. FfDL: A Flexible Multi-Tenant Deep Learning Platform. In Proceedings of the 20th International Middleware Conference.Google ScholarDigital Library
- Jeon et al. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference.Google ScholarDigital Library
- Jiang et al. 2020. Hpc ai500: The methodology, tools, roofline performance models, and metrics for benchmarking hpc ai systems. arXiv preprint arXiv:2007.00279 (2020).Google Scholar
- Justus et al. 2018. Predicting the computational cost of deep learning models. In 2018 IEEE International Conference on Big Data (Big Data).Google ScholarCross Ref
- Karlavs et al. 2020. Building Continuous Integration Services for Machine Learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.Google ScholarDigital Library
- Kim and Lee. 2020. Reducing Tail Latency of DNN-Based Recommender Systems Using in-Storage Processing. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems.Google Scholar
- Knü pfer et al. 2008. The Vampir Performance Analysis Tool-Set. In Tools for High Performance Computing - Proceedings of the 2nd International Workshop on Parallel Tools for High Performance Computing, July 2008, HLRS, Stuttgart.Google Scholar
- Kotthoff et al. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. The Journal of Machine Learning Research, Vol. 18 (2017).Google Scholar
- Kraska. 2018. Northstar: An interactive data science system. Proceedings of the VLDB Endowment, Vol. 11 (2018).Google Scholar
- Li et al. 2018. Ease.Ml: Towards Multi-Tenant Resource Sharing for Machine Learning Workloads. Proc. VLDB Endow., Vol. 11 (2018).Google ScholarDigital Library
- Li et al. 2019. Across-Stack Profiling and Characterization of Machine Learning Models on GPUs. CoRR, Vol. abs/1908.06869 (2019).Google Scholar
- Lim et al. 2019. MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).Google Scholar
- Litjens et al. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis, Vol. 42 (2017).Google Scholar
- Mace et al. 2015a. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP.Google ScholarDigital Library
- Mace et al. 2015b. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation.Google Scholar
- Mai et al. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In OSDI.Google Scholar
- Mattson et al. 2019. Mlperf training benchmark. arXiv preprint arXiv:1910.01500 (2019).Google Scholar
- Mayer and Jacobsen. 2020. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools. ACM Comput. Surv., Vol. 53 (2020).Google Scholar
- Mirgorodskiy et al. 2008. Diagnosing distributed systems with self-propelled instrumentation. In Middleware.Google Scholar
- NVIDIA. 2021 a. NVIDIA Nsight. https://developer.nvidia.com/tools-overview.Google Scholar
- NVIDIA. 2021 b. NVIDIA Profiler. https://docs.nvidia.com/cuda/profiler-users-guide/index.html.Google Scholar
- Ousterhout et al. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI.Google Scholar
- Pi et al. 2018. Profiling distributed systems in lightweight virtualized environments with logs and resource metrics. In HPDC.Google Scholar
- Polyzotis et al. 2019. Data validation for machine learning. Proceedings of Machine Learning and Systems, Vol. 1 (2019).Google Scholar
- Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).Google Scholar
- Salloum et al. 2017. A Survey of Text Mining in Social Media: Facebook and Twitter Perspectives. Advances in Science, Technology and Engineering Systems Journal, Vol. 2 (2017).Google Scholar
- Sergeev and Del. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).Google Scholar
- Shende and Malony. 2006. The TAU Parallel Performance System. IJHPCA, Vol. 20 (2006).Google Scholar
- Sigelman et al. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).Google Scholar
- Tian et al. 2018. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering.Google ScholarDigital Library
- Tian et al. 2019. Towards Framework-Independent, Non-Intrusive Performance Characterization for Dataflow Computation. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems.Google ScholarDigital Library
- Wang et al. 2012. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada, December 3--7, 2012. Proceedings, Vol. 7662.Google Scholar
- Wang et al. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).Google Scholar
- Wang et al. 2020. Metis: learning to schedule long-running applications in shared container clusters at scale. In SC.Google Scholar
- Yang et al. [n.d.]. End-to-end I/O Monitoring on a Leading Supercomputer. In NSDI.Google Scholar
- Yang et al. 2018. Nanolog: A nanosecond scale logging system. In ATC.Google Scholar
- You et al. 2018. Imagenet training in minutes. In ICPP.Google Scholar
- Zaharia et al. 2018. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull., Vol. 41 (2018).Google Scholar
- Zhang et al. 2014. MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines. In CCGrid.Google Scholar
- Zhang et al. 2020. Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20).Google Scholar
- Zhao et al. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In OSDI.Google Scholar
- Zhao et al. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP.Google Scholar
- Zhu et al. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. arXiv preprint arXiv:2006.03318 (2020).Google Scholar
Index Terms
- GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows
Recommendations
Performance metrics and ontologies for Grid workflows
Many Grid workflow middleware services require knowledge about the performance behavior of Grid applications/services in order to effectively select, compose, and execute workflows in dynamic and complex Grid systems. To provide performance information ...
Towards Holistic Continuous Software Performance Assessment
ICPE '17 Companion: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering CompanionIn agile, fast and continuous development lifecycles, software performance analysis is fundamental to confidently release continuously improved software versions. Researchers and industry practitioners have identified the importance of integrating ...
Automatic performance analysis tools for the Grid: Research Articles
Grid PerformanceApplications on Grids require scalable and online performance analysis tools. The execution environment of such applications includes a large number of processors. In addition, some of the resources such as the network will be shared with other ...
Comments