Job placement advisor based on turnaround predictions for HPC hybrid clouds
Introduction
Cloud computing has become an essential platform for several applications and services, including those with High Performance Computing (hpc) requirements. A clear trend is the use of hybrid clouds comprising on-premise1 and remote resources. During peak demands, jobs are submitted to the cloud rather than being submitted to on-premise clusters, which can have long queue waiting times compared to cloud resource provisioning times.
Current work on hpc cloud, also known as hpcaas (hpc as a Service) [1], has mainly focused on understanding the cost–benefits of using cloud over on-premise clusters [2], [3], [4], [5], [6], [7], [8]. In addition, the aim has been to evaluate the performance gap between cloud and on-premise resources [9], [10], [11], [12]. Even though cloud typically has slower internal network speeds than on-premise resources, bursting jobs to the cloud can still provide better overall performance in overloaded environments. Nevertheless, users still struggle to decide where to run their jobs at any given moment due to several factors. Supporting users in such a decision is the objective of this work.
Three factors are crucial for effective job placement in hpc cloud environments [13]: job queue waiting time, execution time, and the relative performance of the cloud compared to the performance of on-premises machines. If users knew how long their jobs would wait in the local queue and how long they would take to run in both environments, then they could obtain the optimal turnaround time. However, this information is not known in advance.
A common strategy is to estimate the waiting time and execution time using historical data. Nevertheless, those estimates are not always accurate. The existing prediction methods make mistakes and cannot be always trusted. In this paper, we propose an advisor: a tool that considers the uncertainty of the predictions to select which environment, either cloud or on-premises, the user should submit her jobs to. Based on a cloud versus on-premise performance ratio, this tool computes a turnaround time estimate for both environments and a measure of uncertainty. Only if the uncertainty is below a threshold, the user is advised to run in the environment with the shortest turnaround time, otherwise the advisor plays it safe by instructing the user to send the job to local resources.
The advisor processes historical logs of queueing systems to extract prediction labels, waiting and execution times, and features. Features can be either fields from the original logs, such as submission time, requested time and requested number of processors, or they can be derived from the queue state, e.g. queue size, processor occupancy, and queued work. Furthermore, scheduling promises are added to the mix of features and are shown to improve substantially the prediction accuracy.
The prediction is based on an Instance-Based Learning (ibl) algorithm [14] (a machine learning method) that relates a new incoming job with similar jobs in the history (data based predictions). The predicted waiting time, execution time and the uncertainty are computed by a function of the labels of similar jobs. These predictions are then combined in the advisor to make allocation decisions. We evaluated the advisor with traces of real job queues and show its benefits under different performance ratios having “saved-time” (Section 5.1) as the main evaluation metric.
The novelty of our work is a detailed study and resource management techniques to advance the state-of-the-art of hpc cloud. In summary, our main contributions are
- •
Decision support tool based on runtime and wait time predictions for hpc hybrid cloud environments (Section 4);
- •
Cutoff function for conservative resource allocation decisions considering the uncertainty of execution and wait time predictions (Section 4, Section 5);
- •
Evaluation of the advisor using traces from real supercomputing workloads with lessons learned on the management of prediction uncertainty. We also evaluated the advisor using real speedup curves from applications executed in six environments: three on-premises and three cloud-based (Section 5);
- •
Machine-learning enhanced predictor that exploits scheduling information as a feature (Section 5);
- •
Feature analysis impact on machine-learning-based predictions of job execution wait times (Section 5).
Section snippets
Related work
Research efforts related to our work are in three major areas: metascheduling, job waiting time prediction, and runtime prediction. Solutions from these areas come from cluster and grid computing but can be applied to hpc cloud.
Problem description
Due to the heterogeneity of jobs in several supercomputing settings, mixing on-premise and cloud resources is a natural way to get the best of the two environments. In hpc hybrid clouds, users can experience fast interconnections in on-premise clusters3 and quick access to resources in the cloud. Hybrid clouds are also cost-effective since it is possible to keep a certain amount of on-premise resources and rent
Advisor tool: overview and method
The focus of this work is on the development and evaluation of a decision-support tool, the advisor, for hybrid clouds built using prediction techniques from the literature [25]. These techniques are based on ibl [14] services that provide two types of predictions: wait time estimates (how long a job is expected to wait in a local resource manager queue before execution) and runtime estimates (how long a job is going to run, once it starts execution).
The advisor acts as a traditional
Evaluation
In this section, we evaluate the advisor’s effectiveness in helping a user place jobs in hpc hybrid clouds. First, we define the execution environment, workloads, and metrics used along with a theoretical model of the relative speed between the cloud and local environment, what we call speed ratio. Furthermore, we define real speed ratios obtained from the literature of hpc cloud. Then, we perform a series of analyses: (i) how the cutoff handles inaccurate predictions; (ii) the impact of using
Conclusion
We proposed an advisor to help users decide where to run jobs in hpc hybrid cloud environments. The decision relies on queue waiting time and execution time of the jobs, which are predicted using traces from past job scheduling data. Even though we have used state-of-art predictors, the estimations of wait time and runtime can be inaccurate.
One of our contributions is to define a cutoff function of the uncertainty, limiting the impact of prediction errors when using cloud resources. This cutoff
Acknowledgments
We thank Warren Smith for providing anonymized data used to validate our estimator implementation. We thank Carlos Cardonha, Matthias Kormaksson, Sam Sanjabi for reviewing early drafts of this work, and the anonymous reviewers. We also would like to thank Dror G. Feitelson for maintaining the Parallel Workloads Archive, and all organizations and researchers who made their workload logs available. This work has been partially supported by finep/mcti under grant no. 03.14.0062.00.
Renato L.F. Cunha is a researcher at the Industrial Cloud Technologies Group at IBM Research, Brazil. He obtained a M.Sc. degree in Computer Science at the Federal University of Minas Gerais, and a B.Sc. degree in Computing Engineering at the Federal University of Espirito Santo, Brazil. He has worked on high-performance computing, development of dropline GNOME for Slackware Linux, and has been involved in several distributed-system projects. Currently, he works on techniques for resource
References (34)
- et al.
Cost minimization for computational applications on hybrid cloud infrastructures
Future Gener. Comput. Syst.
(2013) - et al.
Experience with using the parallel workloads archive
J. Parallel Distrib. Comput.
(2014) - et al.
A data placement strategy in scientific cloud workflows
Future Gener. Comput. Syst.
(2010) - et al.
Enabling high-performance computing as a service
IEEE Comput.
(2012) - S. Ostermann, A. Iosup, N. Yigitbasi, R. Prodan, T. Fahringer, D. Epema, A performance analysis of EC2 cloud computing...
- J. Napper, P. Bientinesi, Can cloud computing reach the top500?, in: Proceedings of the Combined Workshops on...
- C. Vecchiola, S. Pandey, R. Buyya, High-performance cloud computing: A view of scientific applications, in: Proceedings...
- A. Gupta, L.V. Kale, F. Gioachin, V. March, C.H. Suen, B.-S. Lee, P. Faraboschi, R. Kaufmann, D. Milojicic, The who,...
- E. Roloff, M. Diener, A. Carissimi, P.O.A. Navaux, High performance computing in the cloud: Deployment, performance and...
- et al.
Enabling on-demand science via cloud computing
IEEE Cloud Comput.
(2014)
Understanding the performance and potential of cloud computing for scientific applications
IEEE Trans. Cloud Comput.
Deciding when and how to move HPC jobs to the cloud
IEEE Comput.
Instance-based learning algorithms
Mach. Learn.
Cited by (37)
Exploring job running path to predict runtime on multiple production supercomputers
2023, Journal of Parallel and Distributed ComputingAn SMDP approach for Reinforcement Learning in HPC cluster schedulers
2023, Future Generation Computer SystemsCitation Excerpt :For this reason, even though some scheduling algorithms designed with accurate run time estimates in mind may perform well with noisy job run time estimates [32,33], others will suffer significant performance degradation [34]. Additionally, some systems use ML techniques to infer job resource and time requirements [16,1], and those predictions will not be perfect either. In this section, we present two models that can be used to generate noisy run time estimates from jobs sampled by the workload models discussed in the previous section.
Hybrid scheduling for scientific workflows on hybrid clouds
2020, Computer NetworksCitation Excerpt :However, they did not consider workflow scheduling and how it could be cost-efficient in conjunction with their approach. Workflow scheduling within a hybrid cloud is also implicitly related to the workflow type, and a majority of studies have addressed Bag-of-Tasks applications due to its simple structure and parallelism [6,29,42,46–55]. This parallelism can also happen within a workflow structure when tasks waiting for their intermediate datasets become ready during the workflow execution.
Predictive performance modeling for distributed batch processing using black box monitoring and machine learning
2019, Information SystemsCitation Excerpt :The estimation of job finish times depends on predictions of queue time, execution duration, and file transfer duration.2 Cunha et al. [47] consider performance differences between a local cluster and cloud resources. They assume that executing in the cloud slows down execution by a constant, application-dependent factor, for which they propose a linear model and an empirical model, based on relative performance of eight applications on three clusters and on three cloud platforms.
CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-Efficiency
2023, Proceedings of the ACM on Measurement and Analysis of Computing SystemsJob runtime prediction of HPC cluster based on PC-Transformer
2023, Journal of Supercomputing
Renato L.F. Cunha is a researcher at the Industrial Cloud Technologies Group at IBM Research, Brazil. He obtained a M.Sc. degree in Computer Science at the Federal University of Minas Gerais, and a B.Sc. degree in Computing Engineering at the Federal University of Espirito Santo, Brazil. He has worked on high-performance computing, development of dropline GNOME for Slackware Linux, and has been involved in several distributed-system projects. Currently, he works on techniques for resource allocation in HPC cloud environments.
Eduardo R. Rodrigues is a researcher at the Industrial Cloud Technologies Group at IBM Research, Brazil, and his area of expertise is High Performance Computing. Prior to joining IBM, he worked for the Center for Weather Forecast and Climate Studies (CPTEC/INPE) as a research associate, where he developed strategies to scale weather and climate models for large parallel machines. He earned his Ph.D. in Computer Science (2011) from the Federal University of Rio Grande do Sul (UFRGS). During his Ph.D., he worked as a visiting scholar at the University of Illinois at Urbana–Champaign. Eduardo also holds a M. Sc. in Applied Computing from the Brazilian Institute for Space Research (INPE) and a B.Sc. in Computer Information Systems from Bahia State University (Uneb).
Leonardo P. Tizzei is a researcher in the Natural Resources group at IBM Research, Brazil. He is currently engaged in projects on software platforms for systems integration and operation, with a focus on natural-resource industries. Prior to joining IBM Research Brazil, he was a post-doctorate student of the Institute of Computing at University of Campinas (Brazil). He earned a bachelor’s degree (2003), a master’s degree (2007), and a Ph.D. (2012) in Computer Science from the University of Campinas. During his Ph.D., he was a visiting researcher at School of Computing and Communication—Lancaster University (UK) (2009–2010). Leonardo’s research interests lies in the fields of software reuse and maintenance.
Marco A.S. Netto has over 15 years of experience on resource management for distributed systems. He works mainly on Cloud Computing and HPC related topics. He is currently the manager of the Industrial Cloud Technologies Group at IBM Research Brazil and an IBM Master Inventor. Marco published over 40 scientific publications, including journals, conference papers, and book chapters, and has filed over 40 patents. Marco obtained his Ph.D. in Computer Science at the University of Melbourne (2010), Australia.