research-article

Open Access

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

Authors:
Tim Hegeman

Vrije Universiteit Amsterdam, Amsterdam, Netherlands

Vrije Universiteit Amsterdam, Amsterdam, Netherlands
View Profile

,
Matthijs Jansen

Vrije Universiteit Amsterdam, Amsterdam, Netherlands

Vrije Universiteit Amsterdam, Amsterdam, Netherlands
View Profile

,
Alexandru Iosup

Vrije Universiteit Amsterdam, Amsterdam, Netherlands

Vrije Universiteit Amsterdam, Amsterdam, Netherlands
View Profile

,
Animesh Trivedi

Vrije Universiteit Amsterdam, Amsterdam, Netherlands

Vrije Universiteit Amsterdam, Amsterdam, Netherlands
View Profile

ICPE '21: Companion of the ACM/SPEC International Conference on Performance EngineeringApril 2021Pages 57–63https://doi.org/10.1145/3447545.3451185

Published:19 April 2021Publication History

ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering

Pages 57–63

ABSTRACT

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.

References

Agelastos et al. 2014. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. In SC.Google Scholar
Amershi et al. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).Google Scholar
Bal et al. 2016. A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term. IEEE Computer (2016).Google Scholar
Banerjee et al. 2020. Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments. In 2020 USENIX Conference on Operational Machine Learning (OpML 20).Google Scholar
Baylor et al. 2017. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google ScholarDigital Library
Baylor et al. 2019. Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).Google Scholar
Ben-Nun et al. 2019. A modular benchmarking infrastructure for high-performance and reproducible deep learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarCross Ref
Chu et al. 2016. Data cleaning: Overview and emerging challenges. In SIGMOD.Google Scholar
Coleman et al. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training, Vol. 100 (2017).Google Scholar
Crankshaw et al. 2017. Clipper: A Low-Latency Online Prediction Serving System. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation.Google Scholar
Dai et al. 2011. HiTune: Dataflow-Based Performance Analysis for Big Data Cloud. In ATC.Google Scholar
Gadiraju et al. 2020. Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.Google ScholarDigital Library
Gardu n o et al. 2012. Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters. In Strategies, Tools, and Techniques: Proceedings of the 26th Large Installation System Administration Conference, LISA 2012, San Diego, CA, USA, December 9--14, 2012.Google Scholar
Ghorbani and Zou. 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 (2019).Google Scholar
Yolanda Gil and Bart Selman. 2019. A 20-Year Community Roadmap for Artificial Intelligence Research in the US. arxiv: 1908.02624 [cs.CY]Google Scholar
Guo et al. 2011. G2: A Graph Processing System for Diagnosing Distributed Systems. In ATC.Google Scholar
Harlap et al. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).Google Scholar
He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google ScholarCross Ref
Hegeman et al. 2020. Grade10: A Framework for Performance Characterization of Distributed Graph Processing. In CLUSTER.Google Scholar
Hestness et al. 2019. Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.Google ScholarDigital Library
Hopsworks. 2021. Hopsworks. https://www.hopsworks.ai/.Google Scholar
Huang et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems.Google Scholar
Jansen et al. 2020. DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning. In ISC.Google Scholar
Jayaram et al. 2019. FfDL: A Flexible Multi-Tenant Deep Learning Platform. In Proceedings of the 20th International Middleware Conference.Google ScholarDigital Library
Jeon et al. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference.Google ScholarDigital Library
Jiang et al. 2020. Hpc ai500: The methodology, tools, roofline performance models, and metrics for benchmarking hpc ai systems. arXiv preprint arXiv:2007.00279 (2020).Google Scholar
Justus et al. 2018. Predicting the computational cost of deep learning models. In 2018 IEEE International Conference on Big Data (Big Data).Google ScholarCross Ref
Karlavs et al. 2020. Building Continuous Integration Services for Machine Learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.Google ScholarDigital Library
Kim and Lee. 2020. Reducing Tail Latency of DNN-Based Recommender Systems Using in-Storage Processing. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems.Google Scholar
Knü pfer et al. 2008. The Vampir Performance Analysis Tool-Set. In Tools for High Performance Computing - Proceedings of the 2nd International Workshop on Parallel Tools for High Performance Computing, July 2008, HLRS, Stuttgart.Google Scholar
Kotthoff et al. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. The Journal of Machine Learning Research, Vol. 18 (2017).Google Scholar
Kraska. 2018. Northstar: An interactive data science system. Proceedings of the VLDB Endowment, Vol. 11 (2018).Google Scholar
Li et al. 2018. Ease.Ml: Towards Multi-Tenant Resource Sharing for Machine Learning Workloads. Proc. VLDB Endow., Vol. 11 (2018).Google ScholarDigital Library
Li et al. 2019. Across-Stack Profiling and Characterization of Machine Learning Models on GPUs. CoRR, Vol. abs/1908.06869 (2019).Google Scholar
Lim et al. 2019. MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).Google Scholar
Litjens et al. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis, Vol. 42 (2017).Google Scholar
Mace et al. 2015a. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP.Google ScholarDigital Library
Mace et al. 2015b. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation.Google Scholar
Mai et al. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In OSDI.Google Scholar
Mattson et al. 2019. Mlperf training benchmark. arXiv preprint arXiv:1910.01500 (2019).Google Scholar
Mayer and Jacobsen. 2020. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools. ACM Comput. Surv., Vol. 53 (2020).Google Scholar
Mirgorodskiy et al. 2008. Diagnosing distributed systems with self-propelled instrumentation. In Middleware.Google Scholar
NVIDIA. 2021 a. NVIDIA Nsight. https://developer.nvidia.com/tools-overview.Google Scholar
NVIDIA. 2021 b. NVIDIA Profiler. https://docs.nvidia.com/cuda/profiler-users-guide/index.html.Google Scholar
Ousterhout et al. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI.Google Scholar
Pi et al. 2018. Profiling distributed systems in lightweight virtualized environments with logs and resource metrics. In HPDC.Google Scholar
Polyzotis et al. 2019. Data validation for machine learning. Proceedings of Machine Learning and Systems, Vol. 1 (2019).Google Scholar
Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).Google Scholar
Salloum et al. 2017. A Survey of Text Mining in Social Media: Facebook and Twitter Perspectives. Advances in Science, Technology and Engineering Systems Journal, Vol. 2 (2017).Google Scholar
Sergeev and Del. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).Google Scholar
Shende and Malony. 2006. The TAU Parallel Performance System. IJHPCA, Vol. 20 (2006).Google Scholar
Sigelman et al. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).Google Scholar
Tian et al. 2018. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering.Google ScholarDigital Library
Tian et al. 2019. Towards Framework-Independent, Non-Intrusive Performance Characterization for Dataflow Computation. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems.Google ScholarDigital Library
Wang et al. 2012. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada, December 3--7, 2012. Proceedings, Vol. 7662.Google Scholar
Wang et al. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).Google Scholar
Wang et al. 2020. Metis: learning to schedule long-running applications in shared container clusters at scale. In SC.Google Scholar
Yang et al. [n.d.]. End-to-end I/O Monitoring on a Leading Supercomputer. In NSDI.Google Scholar
Yang et al. 2018. Nanolog: A nanosecond scale logging system. In ATC.Google Scholar
You et al. 2018. Imagenet training in minutes. In ICPP.Google Scholar
Zaharia et al. 2018. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull., Vol. 41 (2018).Google Scholar
Zhang et al. 2014. MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines. In CCGrid.Google Scholar
Zhang et al. 2020. Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20).Google Scholar
Zhao et al. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In OSDI.Google Scholar
Zhao et al. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP.Google Scholar
Zhu et al. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. arXiv preprint arXiv:2006.03318 (2020).Google Scholar

Index Terms

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

Performance metrics and ontologies for Grid workflows

Many Grid workflow middleware services require knowledge about the performance behavior of Grid applications/services in order to effectively select, compose, and execute workflows in dynamic and complex Grid systems. To provide performance information ...
Read More
Towards Holistic Continuous Software Performance Assessment
ICPE '17 Companion: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion

In agile, fast and continuous development lifecycles, software performance analysis is fundamental to confidently release continuously improved software versions. Researchers and industry practitioners have identified the importance of integrating ...
Read More
Automatic performance analysis tools for the Grid: Research Articles
Grid Performance

Applications on Grids require scalable and online performance analysis tools. The execution environment of such applications includes a large number of processors. In addition, some of the resources such as the network will be shared with other ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering
April 2021
198 pages
ISBN:9781450383318
DOI:10.1145/3447545
General Chairs:
Johann Bourcier
University of Rennes 1, France
,
Zhen Ming (Jack) Jiang
York University, Canada
,
Program Chairs:
Cor-Paul Bezemer
University of Alberta, Canada
,
Vittorio Cortellessa
University of L'Aquila, Italy
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 April 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MLdevops
data gathering
gradeML
machine learning workflow
modeling
performance analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate252of851submissions,30%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 249
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance metrics and ontologies for Grid workflows

Towards Holistic Continuous Software Performance Assessment

Automatic performance analysis tools for the Grid: Research Articles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance metrics and ontologies for Grid workflows

Towards Holistic Continuous Software Performance Assessment

Automatic performance analysis tools for the Grid: Research Articles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media